In May of this year, the Uptime Institute unveiled its Outage Severity Rating (OSR) system, which according to the organization, aims “to help the digital infrastructure and data center community better understand and articulate service outages in the context of how each incident affects the business.”
Chris Ludeman, director, business development for the Uptime Institute, presented a session on the OSR system as well as the findings of the 9th annual Global Data Center Survey at the 2019 Spring 7x24 Conference held June 2-5, 2019, at the Boca Raton Resort & Club, Boca Raton, FL.
The presentation, titled “Latest Data Center Outage Trends, Causes and Costs,” was a deep dive into the OSR, which Ludeman described as rating system similar to the hurricane rating system, with Level 1 being a negligible outing with little to no service disruption to a Level 5, which is categorized as a business/mission critical outage causing major and damaging disruption of services and/or operations.
According to Ludeman, the rating system was developed to quantify data center outages and as a response to Uptime’s research, which includes annual survey results from its members, reporting from industry trade publications, and the Uptime’s abnormal incident report (AIRs) database. In 2018, the annual survey showed that 31% of respondents had an IT downtime incident or severe degradation in the past year, 48% had an outage in their own site or a service provider’s in the past three years, and 80% reported that their most recent outage was preventable. Of these the top three causes were on-premise data center power failure (31%), network failure (30%), and software/IT systems failure (28%).
Ludeman said that the proportion of serious outages has fallen in the past three years. From 2016 to 2018, the percentage of level 5 outages dropped from 11% to 4%. Another question Uptime asked was about the cost of outages. Fifty percent were under $100,000 and 18% were between 100,000 and 250,000 with the most severe outages costing $5 to $10 million (1%), $10 to $20 million (2%), and over $20 million (1%) with the lowest percentages.
The causes of the outages were also discussed. Ludeman said that the cause of reported outages is changing. Whereas in the past, power failures shouldered the blame, in 2018 only 11% of outages were caused by power vs. 28% in 2017. In 2018 IT systems and network are the cause of outages at 35% and 32% respectively, up from 32% and 19% in 2017. The survey found that IT and network outages were mainly caused by poorly managed upgrades, incorrect programming, failure to back up systems, and data corruption due to configuration and programming errors causing hardware failure. Ludeman said by contrast, power failures are due to weather events such as lightning strikes, operator and utility failure, UPS failures, and damage due to power surges.
Ludeman also discussed outages by industry sector stating that cloud/internet giants were not immune to failure as the survey found 20% of the total outages were from that sector; with SaaS, telecom/network services, and other accounting for 12% of industry sector outages.
What do these survey results mean? Ludeman said the main takeaways are not only do outages continue to be expensive, management shortcomings in the form of visibility and accountability as well as a better understanding of the complex failures that are distributed across silos need to be addressed.
Visit the Uptime Institute for survey results.