Earlier this year, the Ponemon Institute released its third analysis of the cost of data center outages as part of the Data Center Performance Benchmark Series. Now, with three reports conducted over a six-year period using the same methodology, we can compare cost, causes, and duration of data center downtime events from 2010 to 2016.
The comparison shows that, while progress was made between 2010 and 2013, there are signs that this trend may be reversing. Costs continued to rise in the most recent report (Figure 1), which was expected; however, the duration of downtime events, which declined between 2010 and 2013, rose in 2016 to almost the same levels documented in 2010 (Figure 2) when the industry was still dealing with the fallout from the global recession of 2008.
Also of note is the fact that many of the leading causes of downtime identified in 2010 remained as leading causes in 2016 (Figure 3). Cybercrime grew significantly over the course of the three studies — accounting for 22% of outages in 2016. Yet, other leading causes, such as UPS system failure and human error, did not experience significant declines throughout the six-year period.
This can be interpreted as good news or bad news. The good news is that data center operators can significantly reduce their risk of downtime by addressing these causes, which are largely preventable. The bad news is that we have been aware of these causes for six years now and have made little progress in reducing them.
The lack of progress may be attributable to the increasing complexity of data center management and the multiple priorities than now compete for data center resources. Where availability was once job one, two and three for data center managers, today they must address concerns over speed-of-deployment, efficiency, cost management, and productivity while working to ensure uninterrupted availability.
There may also be a perception that effectively addressing these root causes requires significant capital investments. An analysis of the root causes makes clear that this is not the case. Following are five strategies any organization can implement today to minimize its vulnerability to unplanned outages without making major capital investments.
1. Battery Monitoring
The initial Ponemon research on the causes of downtime in 2010 included a survey of 453 individuals responsible for data center operations that identified the leading cause of downtime as UPS battery failure. Of the 95% of participants than experienced an outage in the previous two years, a whopping 65% experienced an outage as a result of UPS battery failure. Studies by the Emerson Network Power service business have also identified the number one of cause of outages broadly classified as UPS System Failure as battery failure.
Batteries are the weak link in the critical power system. They have a limited lifespan, which is dictated by the frequency of discharge, but also affected by temperature, charging cycles and other factors. It’s impossible to predict with any certainty the lifespan of a particular battery.
Integrated battery monitoring strengthens this weak link. Battery monitoring systems provide continuous visibility into battery health — including cell voltage, resistance, current, and temperature — without requiring a full discharge and recharge cycle. This allows batteries to be utilized fully while preventing unanticipated failure. These systems also support predictive analysis, which can optimize replacement cycles.
Data centers dependent on batteries for ride-through power should strongly consider an integrated battery monitoring system to ensure batteries provide the necessary backup power when needed. In our experience, it is the single most important thing an organization can do to prevent UPS system-related downtime.
2. Preventive Maintenance
UPS system failure can also be addressed through a disciplined approach to preventive maintenance. All electronics contain limited-life components that need to be inspected frequently, and serviced and replaced periodically, to prevent catastrophic failures. If not serviced properly, the risk of unplanned UPS failure increases.
A study of 5,000 three-phase UPS units with more than 185 million combined operating hours found that the frequency of preventive maintenance visits correlated with an increase in mean time between failure (MTBF) (Figure 4). Preventive maintenance conducted every other month increased MTBF more than 80-fold compared to no preventive maintenance.
This isn’t to suggest that every UPS should have six preventive maintenance visits annually. That typically isn’t cost-effective. Most organizations can optimize their maintenance investment through two preventive maintenance visits annually.
Preventive maintenance is a common target when budget cuts are mandated but it is important to recognize that there is a cost associated with these cuts in the form of increased risk. As the 2016 Cost of Data Center Outages Report documents that cost of downtime is growing, and cost savings by cutting preventive maintenance could result in a large, unanticipated expense.
3. Policies and Procedures
The publication of the first Ponemon study, along with other industry educational efforts, increased awareness of the vulnerability of unshielded, unlabeled or poorly positioned EPO buttons. That’s the low-hanging fruit in the Human Error category and an issue most organizations should have addressed by now.
Yet, human error continues to account for more than one in five outages. Clearly, minimizing human error isn’t as simple as shielding a button. It requires well-documented procedures, consistent training and regular practice.
One of the challenges we often face when working with a customer on a power system upgrade is that the one-line diagram no longer reflects the current state of the data center, which has evolved since the original one-line was created. It’s essential to have a clear, up-to-date picture of what’s in the data center and how it is configured to respond efficiently to an outage.
Equally important is documenting tasks to effectively respond to outages and establish a schedule to practice for outage events. Two best practice options: schedule regular “pull-the-plug” tests to ensure people and equipment react appropriately during an event; or schedule less extreme simulations, such as automated battery tests.
The key is to balance the level of risk you are willing to absorb with the need to accurately simulate real-world conditions, and performing these tests frequently enough to allow personnel to get comfortable acting under the pressure of an outage.
4. Enhanced Thermal Management
Thermal and water-related issues showed little improvement between 2013 and 2016, accounting for 12% of outages in 2013 and 11 percent in 2016.
One factor is likely the same preventive maintenance issue noted as a contributor to UPS system failure. When precision cooling units aren’t subject to regular maintenance, mechanical components will eventually wear to the point of failure. If the unit is not being remotely monitored, that failure may not be noticed until increased temperatures begin to affect server operation.
In addition to preventive maintenance, another solution to thermal challenges is the use of intelligent thermal control systems. These controls enable machine-to-machine communication so thermal units across a facility can work as a team. They automate cooling system operational routines, such as temperature and airflow management, valve auto-tuning, lead/lag, and other factors that enhance overall system performance. In addition, they provide centralized visibility into unit operation that can be used to guide maintenance and help ensure any failure doesn’t affect IT systems.
When chilled water is used for heat removal, a leak detection system should also be employed. These systems use sensors installed at critical points throughout the data center to detect potentially hazardous moisture levels and trigger alarms.
5. Centralizing Infrastructure Management
Data center infrastructure management (DCIM) is the final piece of the availability puzzle. DCIM vendors have made real progress in making DCIM easier to deploy and use and it has become a valuable tool for organizations seeking to maximize availability.
Two capabilities, in particular, can help prevent downtime. First, is the ability to consolidate monitoring data across all systems to highlight potential infrastructure issues before they impact operations. The other is the ability to better understand the interdependencies between data center systems. This is especially important as data center capacity management becomes more dynamic. As loads are shifted to available resources, it’s critical to know whether the infrastructure supporting those resources has the capacity to support the new load, to prevent problems such as exceeding UPS capacity or creating hot spots that can damage equipment.
While DCIM can impact many aspects of data center operations, for many organizations it’s primary benefit is the visibility into operating conditions across systems and the role that visibility can play in preventing downtime.
The job of managing a data center is increasingly complex and resources are always limited. Yet, businesses are more dependent on their data centers than ever and the cost of downtime continues to rise, with costs for some facilities exceeding $2 million per incident. Many of the causes of downtime are preventable through easily accessible systems, such as battery monitoring and thermal controls, and improved policies and procedures in the areas of maintenance and preparation.
I’m fairly confident that when Ponemon conducts a fourth study in 2019, costs will be higher than they are today. But I’m also hopeful that the frequency and duration of outages will be much lower. We have the tools and knowledge to make that possible. We just have to put them into practice.