In today's world of digital infrastructure, data center owners spend countless hours on design due diligence — extracting better quality from vendors and contractors and demanding commissioning agents find all of the mistakes others missed — just so they can brag about building a data center that will never fail.
But, the dilemma is, they do fail — and they do it on a regular basis. Fortunately, built-in redundancy, extensive commissioning, and startup staff training can reduce the frequency of interruptions.
That's not the end of the story though. In fact, it's just the beginning — maybe the first three chapters out of 100 or more. If the team did everything right, then this data center fairytale is off to a good start, but ... will it last?
Ideally, operating staff would be brought including during the construction and commissioning processes, allowing for a seamless transfer of knowledge to the first generation of operators. This insight goes a long way toward keeping abnormal operating events in the "save" category versus the "outage" one.
The roles of staff during outages tend to be clear. They each have defined roles and processes to follow, and they know their end game is to restore services ASAP. How well they perform depends on their individual skills and knowledge of the systems they need to restore.
The role of management is not as clear, however, as there are a multitude of outage levels. Where and when does management get involved, and when is the situation escalated to the C-suite? Most everyone has escalation procedures that generally work well through middle management, but, procedures or not, there is always reluctance to immediately notify the C-suite. Then there is the issue of when the primary C-suite contact is unreachable (i.e., on vacation, in transit, etc.). Who is in charge then?
Colocation, hyperscale cloud provider, and internal enterprise clients (yes, they do still exist) have questions when outages occur — what happened, what is the repair time, when will services be restored, and what is being done to make sure this doesn’t happen again?
On the surface, these seem like straightforward questions, but, in practice, they tend to be poorly handled. The frontline responders have a tendency (or they've been instructed) to say nothing — just deny the issue and hope it goes away. Middle management seems to say whatever satisfies the customer at the time even if there's not enough information available to comment.
Indeed, managing outages today is a race to acknowledge the incident before the client demands answers. Think power and cable, for instance. When consumers can't access the services they pay for, they call the provider. No one wants to hear about an areawide outage after 45 minutes of listening to elevator music and being transferred to "the right department" over and over — they want to hear about it as soon as the call starts from an automated recording.
It's hard to believe situations like this actually exist, but they do. In fact, it was just about a year ago when a major colo provider had a partial outage one day at 4 a.m. It took until noon for the operator to acknowledge there was a problem. After that, it took another 10 hours to repair.
While cable is not a mission critical service, it does point out that immediate knowledge of outages is important — even at the frontline staff level. The aforementioned colo provider learned this lesson the hard way after being chastised on social media.
Full data center outages are much more critical, tend to be of longer durations, and often require the chief experience officer (CxO) to be front and center.
So, what does the CxO have to do to handle the situation?
There are seven fundamental elements — tactical and strategic — that need to be addressed.
1. Connectivity to the Field
In a crisis, one cannot wait for reports to flow through management layers. The CxO needs to receive unwhitewashed data directly from the people addressing the cause of the outage. If proximity allows, the CxO should go directly to the scene to gain firsthand knowledge of the situation. If proximity does not allow, then connectivity via security cameras and/or live cellphone video is the next best thing.
This prepares the CxO to discuss the situation and strategically think about securing standby resources that might be needed both for and beyond the immediate situation.
2. Routine Communication
Outside of the facts, this is probably the most important step to be prepared for. With the data center down, chances are the regular means of communications may not be operational, so the service provider needs to be prepared to initially connect with clients through other independent means, such as social media, contact lists (hard copies), and landlines.
In data center environments, a few seconds can seem like a lifetime, so initial communication acknowledging the situation followed by regular updates is essential to keeping clients informed.
Setting up an initial virtual status meeting within two hours and, subsequently, every four hours keeps the clients informed allowing them to form plans for their own companies.
3. Assigned Agent
As the number of affected clients grows, the ability of the CxO to address specifics diminishes. Furthermore, clients may have specific concerns they cannot discuss on a group call. Having an assigned person as the primary interface for each client will alleviate much of the tension during the group calls.
There was a time when clients knew extraordinarily little about their IT operations or data centers. Those days are long gone. Clients today are tech savvy and, during a crisis, are well-equipped to challenge most any scenario put forth. The CxO must be clear and understandable when making any status reports/statements. If information is not yet available, say so. Lay out a process to address the situation versus overpromising without all of the facts.
5. Listen and Empathize
Above all, the CxO needs to listen to the pain points and develop actions to alleviate the worst of them whenever possible. Recognizes the stress the outage is placing on the client operations is crucial.
Vague commitments do not work. Saying “We have all available resources working on it” tells clients nothing. It is far better to state specifics even if the specific is not the desired result. For example, if the UPS failed and the service tech is still an hour away, then say that. This would be followed by explaining what is being done to get the raw utility power to the data center floor. Let the client decide if they want to run anything on the unprotected utility power. Give them a choice to restart or wait it out.
7. Lessons Learned Recap
Within 30 days of an outage data center operators need to have a recap of lessons learned — both internally with employees and externally with clients.
Give the clients a chance to explain how they were affected and discuss what would have made things better from their perspective.
These meetings should result in a solid action plan to address any shortcomings of the site, procedures, or means of communication when the facility is down.
In the end, this is all about building trust with the client, which will go a long way in retention and referrals.