This website requires certain cookies to work and uses other cookies to help you have the best experience. By visiting this website, certain cookies have already been set, which you may delete and block. By closing this message or continuing to use our site, you agree to the use of cookies. Visit our updated privacy and cookie policy to learn more.
This Website Uses Cookies
By closing this message or continuing to use our site, you agree to our cookie policy. Learn More
This website requires certain cookies to work and uses other cookies to help you have the best experience. By visiting this website, certain cookies have already been set, which you may delete and block. By closing this message or continuing to use our site, you agree to the use of cookies. Visit our updated privacy and cookie policy to learn more.
Mission Critical logo
search
cart
facebook twitter linkedin youtube
  • Sign In
  • Create Account
  • Sign Out
  • My Account
Mission Critical logo
  • Home
  • Products
  • News
  • Topics
    • Cloud Strategy
    • Cooling
    • Energy
    • Facility Design
    • Infrastructure
    • Management
    • Power
    • The Optimal Edge
    • IT-Colo Strategic Planner
  • Columns
    • Case Studies
    • Writing on the Edge
    • Guest Column
    • Hot Aisle Insight
    • On Target Series
    • On the Road
    • Security Perspectives
    • Sustainable Operations
    • Unconventional Wisdom
    • Zinc Whiskers
  • Blog
  • Education
    • Quizzes
    • Data Center Design Information
    • Continuing Education
    • Technical Advisory Board
    • DCEP Training
  • Multimedia
    • Videos
    • Podcasts
    • eBook
    • Photo Galleries
    • eNewsletter
    • Webinars
    • Case Study eBlasts
    • Sneak Peek Spotlights
    • Video Spotlights
    • White Paper eBlast
  • Resources
    • Facility Manager of the Year
    • Industry Events
    • White Papers
    • Classifieds
    • Store
    • Partners
    • Market Research
    • Custom Content & Marketing Services
  • Magazine
    • Current Issue
    • Digital Editions
    • Subscribe
  • Directory
    • Buyers Guide
    • Get Listed
    • Take a Tour
  • Contact
    • Contact Us
    • Advertise
Home » Blogs » Data Center Spotlight » Re-Evaluating Data Center Availability In 2016
Mc0114-panfil-peterpanfil-107

With more than 30 years of experience in embedded controls and power, Peter Panfil leads global power sales for Emerson Network Power’s Liebert AC Power business.

Re-Evaluating Data Center Availability In 2016

Five strategies for improving the reliability of your critical infrastructure.

7.18.16 Figure 1
Figure 1. Costs for a full data center outage. Costs have continued to rise with the largest increases being seen in maximum costs, which now exceed $2 million per incident.
7.18.16 Figure 2
Figure 2. The duration of unplanned outages decreased from 2010 to 2013, but rose between 2013 and 2016.
7.18.16 Figure 3
Figure 3. Several of the primary causes of downtime identified in 2010 continued as leading causes in 2016.
7.18.16 Figure 4
Figure 4. Two preventive maintenance visits per year is a practical tool to enhance operations without endangering data center operations.
7.18.16 Figure 1
7.18.16 Figure 2
7.18.16 Figure 3
7.18.16 Figure 4
July 18, 2016
Peter Panfil
No Comments
KEYWORDS data center infrastructure management / data center power / data centers
Reprints

Earlier this year, the Ponemon Institute released its third analysis of the cost of data center outages as part of the Data Center Performance Benchmark Series. Now, with three reports conducted over a six-year period using the same methodology, we can compare cost, causes, and duration of data center downtime events from 2010 to 2016.

The comparison shows that, while progress was made between 2010 and 2013, there are signs that this trend may be reversing. Costs continued to rise in the most recent report (Figure 1), which was expected; however, the duration of downtime events, which declined between 2010 and 2013, rose in 2016 to almost the same levels documented in 2010 (Figure 2) when the industry was still dealing with the fallout from the global recession of 2008.

Also of note is the fact that many of the leading causes of downtime identified in 2010 remained as leading causes in 2016 (Figure 3). Cybercrime grew significantly over the course of the three studies — accounting for 22% of outages in 2016. Yet, other leading causes, such as UPS system failure and human error, did not experience significant declines throughout the six-year period.

This can be interpreted as good news or bad news. The good news is that data center operators can significantly reduce their risk of downtime by addressing these causes, which are largely preventable. The bad news is that we have been aware of these causes for six years now and have made little progress in reducing them.

The lack of progress may be attributable to the increasing complexity of data center management and the multiple priorities than now compete for data center resources. Where availability was once job one, two and three for data center managers, today they must address concerns over speed-of-deployment, efficiency, cost management, and productivity while working to ensure uninterrupted availability.

There may also be a perception that effectively addressing these root causes requires significant capital investments. An analysis of the root causes makes clear that this is not the case. Following are five strategies any organization can implement today to minimize its vulnerability to unplanned outages without making major capital investments.

1. Battery Monitoring

The initial Ponemon research on the causes of downtime in 2010 included a survey of 453 individuals responsible for data center operations that identified the leading cause of downtime as UPS battery failure. Of the 95% of participants than experienced an outage in the previous two years, a whopping 65% experienced an outage as a result of UPS battery failure. Studies by the Emerson Network Power service business have also identified the number one of cause of outages broadly classified as UPS System Failure as battery failure.

Batteries are the weak link in the critical power system. They have a limited lifespan, which is dictated by the frequency of discharge, but also affected by temperature, charging cycles and other factors. It’s impossible to predict with any certainty the lifespan of a particular battery.

Integrated battery monitoring strengthens this weak link. Battery monitoring systems provide continuous visibility into battery health — including cell voltage, resistance, current, and temperature — without requiring a full discharge and recharge cycle. This allows batteries to be utilized fully while preventing unanticipated failure. These systems also support predictive analysis, which can optimize replacement cycles.

Data centers dependent on batteries for ride-through power should strongly consider an integrated battery monitoring system to ensure batteries provide the necessary backup power when needed. In our experience, it is the single most important thing an organization can do to prevent UPS system-related downtime.

2. Preventive Maintenance

UPS system failure can also be addressed through a disciplined approach to preventive maintenance. All electronics contain limited-life components that need to be inspected frequently, and serviced and replaced periodically, to prevent catastrophic failures. If not serviced properly, the risk of unplanned UPS failure increases.

A study of 5,000 three-phase UPS units with more than 185 million combined operating hours found that the frequency of preventive maintenance visits correlated with an increase in mean time between failure (MTBF) (Figure 4). Preventive maintenance conducted every other month increased MTBF more than 80-fold compared to no preventive maintenance.

This isn’t to suggest that every UPS should have six preventive maintenance visits annually. That typically isn’t cost-effective. Most organizations can optimize their maintenance investment through two preventive maintenance visits annually.

Preventive maintenance is a common target when budget cuts are mandated but it is important to recognize that there is a cost associated with these cuts in the form of increased risk. As the 2016 Cost of Data Center Outages Report documents that cost of downtime is growing, and cost savings by cutting preventive maintenance could result in a large, unanticipated expense.

3. Policies and Procedures

The publication of the first Ponemon study, along with other industry educational efforts, increased awareness of the vulnerability of unshielded, unlabeled or poorly positioned EPO buttons. That’s the low-hanging fruit in the Human Error category and an issue most organizations should have addressed by now.

Yet, human error continues to account for more than one in five outages. Clearly, minimizing human error isn’t as simple as shielding a button. It requires well-documented procedures, consistent training and regular practice.

One of the challenges we often face when working with a customer on a power system upgrade is that the one-line diagram no longer reflects the current state of the data center, which has evolved since the original one-line was created. It’s essential to have a clear, up-to-date picture of what’s in the data center and how it is configured to respond efficiently to an outage.

Equally important is documenting tasks to effectively respond to outages and establish a schedule to practice for outage events. Two best practice options: schedule regular “pull-the-plug” tests to ensure people and equipment react appropriately during an event; or schedule less extreme simulations, such as automated battery tests.

The key is to balance the level of risk you are willing to absorb with the need to accurately simulate real-world conditions, and performing these tests frequently enough to allow personnel to get comfortable acting under the pressure of an outage.

4. Enhanced Thermal Management

Thermal and water-related issues showed little improvement between 2013 and 2016, accounting for 12% of outages in 2013 and 11 percent in 2016.

One factor is likely the same preventive maintenance issue noted as a contributor to UPS system failure. When precision cooling units aren’t subject to regular maintenance, mechanical components will eventually wear to the point of failure. If the unit is not being remotely monitored, that failure may not be noticed until increased temperatures begin to affect server operation.

In addition to preventive maintenance, another solution to thermal challenges is the use of intelligent thermal control systems. These controls enable machine-to-machine communication so thermal units across a facility can work as a team. They automate cooling system operational routines, such as temperature and airflow management, valve auto-tuning, lead/lag, and other factors that enhance overall system performance. In addition, they provide centralized visibility into unit operation that can be used to guide maintenance and help ensure any failure doesn’t affect IT systems.

When chilled water is used for heat removal, a leak detection system should also be employed. These systems use sensors installed at critical points throughout the data center to detect potentially hazardous moisture levels and trigger alarms.

5. Centralizing Infrastructure Management

Data center infrastructure management (DCIM) is the final piece of the availability puzzle. DCIM vendors have made real progress in making DCIM easier to deploy and use and it has become a valuable tool for organizations seeking to maximize availability.

Two capabilities, in particular, can help prevent downtime. First, is the ability to consolidate monitoring data across all systems to highlight potential infrastructure issues before they impact operations. The other is the ability to better understand the interdependencies between data center systems. This is especially important as data center capacity management becomes more dynamic. As loads are shifted to available resources, it’s critical to know whether the infrastructure supporting those resources has the capacity to support the new load, to prevent problems such as exceeding UPS capacity or creating hot spots that can damage equipment.

While DCIM can impact many aspects of data center operations, for many organizations it’s primary benefit is the visibility into operating conditions across systems and the role that visibility can play in preventing downtime.

The job of managing a data center is increasingly complex and resources are always limited. Yet, businesses are more dependent on their data centers than ever and the cost of downtime continues to rise, with costs for some facilities exceeding $2 million per incident. Many of the causes of downtime are preventable through easily accessible systems, such as battery monitoring and thermal controls, and improved policies and procedures in the areas of maintenance and preparation.

I’m fairly confident that when Ponemon conducts a fourth study in 2019, costs will be higher than they are today. But I’m also hopeful that the frequency and duration of outages will be much lower. We have the tools and knowledge to make that possible. We just have to put them into practice.

Recent Comments

OTDR

Informative article!!

I attended too and was glad to see...

Mc0114-panfil-peterpanfil-107

With more than 30 years of experience in embedded controls and power, Peter Panfil leads global power sales for Emerson Network Power’s Liebert AC Power business.

You must login or register in order to post a comment.

Report Abusive Comment

Subscribe For Free!
  • Print & Digital Edition Subscriptions
  • eNewsletter
  • Online Registration
  • Subscription Customer Service

More Videos

Popular Stories

Stay or Migrate?

To Stay or To Migrate — That Is the Question …

Figure-1-JP

Energy Storage Trends and Technology Innovation for Mission Critical Infrastructure

2020

Top IT Predictions for 2020 and Beyond

human

Humans Present Biggest Cybersecurity Risk

top gun

IT Hardware Leasing vs. Buying Equipment

MC Rittal Custom Content


 

IT-COLO Strategic Planner check it out360

Events

December 30, 2030

Webinar Sponsorship Information

For webinar sponsorship information, visit www.bnpevents.com/webinars or email webinars@bnpmedia.com.

View All Submit An Event

Poll

Is your infrastructure friend or foe?

Is your infrastructure friend or foe?
View Results Poll Archive

Products

Handbook of Data Center Management, Second Edition (CRC Press Revivals)

Handbook of Data Center Management, Second Edition (CRC Press Revivals)

See More Products

MC Optimal Edge

Mission Critical Magazine

MC-Cover Sept Oct-digital 2019

2019 September/October

This issue of Mission Critical sheds light on the IT dilemma and explores other topics like IT and data center automation, powering mission critical facilities, the demands of the IoT, and more. Additionally, the IT-Colo Strategic Planner is included as a special edition with the September/October 2019 issue.

View More Create Account
  • Resources
    • List Rental
    • Security Group
    • Advertiser Index
    • Product Info (Free)
    • Editorial Guidelines
    • Privacy Policy
    • Survey And Sample
  • Want More
    • Connect

Copyright ©2019. All Rights Reserved BNP Media.

Design, CMS, Hosting & Web Development :: ePublishing