In previous columns, I’ve discussed the skills shortage, educational institutions that are scaling up curriculums, and the certified mission critical operator (CMCO) program (www.mccerts.com) for validating data center operational skills. If the industry was at a standstill, these traditional learning experiences would be sufficient. Despite the global pandemic and economic chaos that follows, the data center industry is not at a standstill. If anything, it is accelerating and challenging the skills of those who must meet the daily operational demands with whatever limited knowledge they have at their disposal.
So, in the data center world of constant change, where does good, honest, and transparent data center operational knowledge come from?
In my early days, I remember a West Coast colocation that lost half of the power in its facility. This went on for over 24 hours with rumors spreading rapidly throughout the industry about the company’s eminent demise.
Why would anybody colocate in such an unreliable facility? Why would existing clients stay?
The market spoke, and, within one month of the incident, the facility went from 55% occupancy to nearly 100% occupancy.
This success story is a credit to the colos management and staff who kept their clients so well-informed during the incident regarding repair status that those clients became their best sales tool. It was their transparency skills that not only saved them but also helped them to excel.
When things go wrong in a data center, those affected have no stomach for double talk, deflecting statements, or passing the buck. Clients just want the truth, so they can make good decisions about their own operations. Over the years, I have evaluated many incidents, and it usually begins with everyone sitting in a room pointing fingers at each other. Eventually we get to the facts, create a “lessons learned package,” and update operational scripts.
The lessons learned and updated scripts improve the situation for the colo staff and their clients, as they all learn something new about their data centers and now have processes in place to address repeat incidents. Unfortunately, that is about as far as the knowledge train goes. Seldom does the information ever get shared beyond the walls of the data center where it occurred, and it may not even get shared with other groups at the same facility. When this happens, only those involved in the incident gain knowledge, while the rest of the industry is left in the dark.
It was once acceptable not to talk about incidents for fear the competition could twist the facts, create false rumors, or use incident knowledge against you. But the industry is maturing. It has changed, with tens of thousands of data centers around the globe having clients that are more mobile than in the past. These clients are demanding more transparency in day-to-day operations.
Data center clients have become much savvier themselves and have their own remote monitoring systems with time-stamped reports of the how and when their operation was affected by a data center incident. Even without such documentation, there are so many smartphones in and around the data center, chances are an incident will be recorded and posted on social media before the operations group has identified the cause.
The days of keeping an incident hush-hush or sweeping it under the rug are gone — hopefully, forever.
Routine surveys conducted by the Uptime Institute show the number of data cener incidents are on the rise. Whether this is because the industry is becoming more open about incidents or more incidents are occurring can be a matter of debate, but that is not the real issue.
Sure, as an industry, we are becoming more open to acknowledging that we have incidents (everyone has them). Now, we need to take the next step by sharing the facts, responses, and solutions as part of a global database that will look at incident types, frequencies of occurrence, operational impacts, and more. Company name and site location is immaterial — this isn’t about pointing fingers; it’s about creating the opportunity for growth. A global knowledge database allows us to train the next generation of data center operators by learning from the mistakes of the industry’s past.
At the start of this article, I mentioned a West Coast colocation operator. While this operator had a brilliant outcome, whatever happened to all the detailed intellectual knowledge gained from the incident? Was it ever written down to be shared with the next generation of operators and design engineers? No — the knowledge only resides with those who were directly involved. When those people leave the industry, their experience and knowledge leaves with them.
To address the constant knowledge drain, a nonprofit group was formed in 2017 — Data Center Incident Reporting Network (DCIRN) www.dcirn.org. You may notice from my bio that I serve as CEO of the Americas region for DCIRN, and I think now is a better time than any to share a little bit more about what that entails.
The DCIRN mission is to establish and analyze a database of data center incidents, operating anomalies, and service interruptions to identify, categorize, and quantify the causes. These are the basic parameters necessary for the industry’s continuous service delivery improvements. It allows the industry to learn from our collective experiences and becomes the foundational database for teaching the next generation of operators.
Gathering the data
The database will be built by everyone on the frontlines of the industry today. Data center operators have the best vantage point to present incidents in their proper context. Whether the incidents are caused by humans, component failures, system hiccups, installation, or design oversights, operators are in the best position to provide the lessons learned and operational changes.
In years past, information would come from vendors, consultants, and trade associations. They continue to have the advantage of market knowledge because they’ve worked on many sites, affording them many experiences of issues and how they were addressed.
There also is a significant amount of knowledge that can be gleaned from those involved in commissioning. The certified commissioning authority (CxA), the CM, prime subs, and the operating team can all contribute invaluable data that, no doubt, can be utilized to minimize future incidents at data centers as well as other sites. This may include everything from absent control wires to inoperable valves to uncalibrated sensors.
Then, the insurance industry is developing processes aimed at minimizing future claims and lowering premiums.
Incidents come in numerous different forms and are not restricted to “data center down.” Minor incidents, like a UPS going into bypass or a chiller tripping offline, are just as important, and DCIRN wants to record them. Don’t forget those near misses or “saves,” which are valuable tools in avoiding downtime. By collecting global data, DCIRN will be able to identify inherent issues in products or designs just like the data collected on air bags led to product recalls and design changes for the auto industry.
DCIRN awaits the input and support from everyone in the data center industry.