Figure 1. Leading causes of downtime in data centers


 

Arecurring criticism of the Uptime Institute’s Tier Classification System was that it did not do enough. There are aspects of a data center operation-critical to ongoing uptime-that the four tiers are silent upon.

When developed in the 1990s, the tiers filled a specific need. “At that time, there was no grading criteria or performance standard for data center,” explains Pitt Turner, Uptime Institute executive director and Tier Classification System co-author. “The tiers started with the request of an owner in growth mode through M&A. When the executives walked through a new data center, they were sure they had a ‘great’ new data center. Tiers were developed to fill this need-to help educate executives that just because a computer room has clean raised floor does not mean it meets the business needs. Tiers were created to establish a simple and clear way to differentiate the increasing functionality of data center site infrastructure. The tier standard is now the international reference for data center performance” (see figure 1).

On July 1, 2010, the Uptime Institute substantially expanded tiers to address additional fundamentals of data center uptime. This expansion was necessary because the tiers guide the design, construction, and commissioning of a facility, but those are only the beginning of developing a successful data center: the beginning of an unrelenting endeavor from day one of operations to obsolescence of the data center.

Operational sustainability is a language and rating system that complements the four tiers with focus on the operation of the data center infrastructure. “Poor operations are more important than the tier of the design because they can defeat even a fault tolerant infrastructure. Operations are the showstopper at any data center,” states Vince Renaud, co-author of both the tier and operational sustainability standards. Before elaborating on the development and focus of operational sustainability, it is important to re-affirm the intent and use of the Tier Classification System.

The greatest vulnerability of the tiers is to strip them of business case. The ‘work’ of the tiers is to align an organization’s tolerance (or intolerance) for downtime and its infrastructure solution. Divorced from business case, the tiers are subjective. Misapplication of the tiers as an adjectival rating system (good-better-best) is a use inconsistent with its intent. Misuse of the tiers results in a design solution that is temporary or meaningless in terms of business value. In other words, knives make short-lived screwdrivers.
 

Table 1. Comparison of tiers as defined by the Uptime Institute's Tier Classification System

 

Used effectively, the tiers are protection against penalty and pain. The cost of downtime in both real dollars and incalculable adverse effects drives infrastructure investment. Selection of a response tier results from diligent investigation assessment into the owner’s reliance on its data center to continue to be profitable.

Calculating downtime is crucial to the use of the tiers and is sometimes shorted even by the largest IT-centric organizations. This is a considerable effort that requires cash, staff, and months to complete. The cost of downtime is the lynchpin of the justification for a data center and its performance level. Calculating this cost involves investigation into the IT needs of the various data center clients, whether internal or external. Then, the clients must be brought to consensus on the penalties that the organization will endure for planned or unplanned outages. The anticipated hurt will be both tangible and intangible: lost revenue, fines, sagging share value, adverse market perception, and increased regulatory attention.

Ignoring these tasks can lead to a tier decision based on subjective criteria. For example, the Uptime Institute regularly encounters data center projects that are justified in terms of pride of ownership (e.g., world class, gold-plated). If the infrastructure investment is based on the ‘best’ outside the content of business requirements, the project will be cancelled or shrunk when submitted to a rigorous analysis by its internal or external oversight. This holds true for enterprise or colocation data centers.
 

Table 2. Attributes of tiers defined for operational stability

 

Using the tiers effectively also involves understanding the focus and limitations of that system. For example, the Tier Classification System rates the infrastructure at each site. Organizations may operate a centralized high-functionality data center or a number of lower-functionality sites. In the latter scenario, the redundancy is in the IT topology rather than the site-level infrastructure. However, site redundancy does not yield aggregate tier ratings (Tier II + Tier II ­ Tier IV). The reason is that, regardless of IT redundancy, each site is fundamentally constrained in terms of maintenance opportunities and fault response.

The Tier Classification System does not provide any assurances to the effectiveness with which the infrastructure is located or operated. These decisions can be broken into three elements: site, building, and human factors.

The operation of the site is more impactful than the tier of the design. The recurring and imminent nature of operations issues is best illustrated by real data of failures reported to the Uptime Institute’s Abnormal Incident Report database in 2010. On average, the sites reporting to the Uptime Institute database are 60,000 sq ft of computer room and a 24 x 7 performance objective. A sample of the 2010 failures is provided, along with a summary of lessons learned.

  • A UPS system was EPO’d due to human error.
     
  • Proper scripting and work procedures would have prevented this outage.
     
  • Normal maintenance transfer of a single UPS module was thwarted by a selector switch that hung up and went un-noticed.
     
  • Had position of the switch been noted during troubleshooting when the UPS first gave “Not OK to Transfer,” additional troubleshooting efforts would have ensued and the outage avoided.
     
  • A single smoke detector alarmed resulting in EPO of HVAC and critical power to the floor. The cause was traced to improper fire panel sequencing that had not been fully checked during commissioning.
     
  • Fully detailed sequence of operations of the fire suppression, thorough commissioning, and comprehensive training of the site staff in EPO abort procedures would have prevented this outage.
     
  • A vendor closed a breaker assumed to have been previously opened for the work effort. Failure analysis revealed that the breaker had been opened by another vendor a year prior while working on the EPO system.
     
  • Had proper work scripting and work procedures been followed the previous year, and closed out, this outage would have been avoided.


Figure 2. A junk pile on a raised floor suggests that the owner of even a well-designed system might be running unnecessary risks

 

Rick Schuknecht, vice president of operational sustainability, analyzed 15 years of operations-related failures, revealing ‘alarming’ conditions.

“Trending from that database gives us a glimpse of the “State of the Union” as it relates to data center operations. The call-to-action to management and site operations team is clear: a) 70 percent of the reported failures are directly attributable to human error; b) manufacturing (infrastructure equipment) and operational issues combined to form more than 80 percent of all event reports submitted; and c) process failures accounted for 50 percent of all operational failures reported. These trends were also a call-to-action for the Uptime Institute as a standards-issuing body,” Schuknecht said. (See figure 2.)

There are other components of a site operation that represent latent, rather than catastrophic, risks to sustained operations. A lack of a rigorous maintenance plan will result in deferred maintenance, which reduces reliability of the equipment and extends the duration of an outage. Poor housekeeping procedures will result in, worst case, a fire due to introduction of combustibles. However, is a lesser condition acceptable, such as unstable IT performance and voided server warranties due to contamination?

The root cause of these failures is not just human error, but management error, as Ken Brill, founder of the Uptime Institute explains, “Seventy percent of all failures are associated with human activity. Two thirds of the 70 percent, is what I call now call management error-things management has allowed to happen which are now well known to have risk of catastrophic consequences (see figures 3 and 4). It is clear to me that capital investment in engine generators, uninterruptible power systems, and other physical ‘things,’ is insufficient to ensure information uptime. In fact, sites with good and well motivated facility staffs have consistently out-performed those with all the latest infrastructure equipment but who have poor leadership, inadequate training or staffing, or fail to use procedures when performing critical work.”

As of 2009, there was no industry-recognized method to tie the robustness of the site management program with data center performance. No language to impress upon management that site operations and uptime were inextricably joined. The Uptime Institute responded in 2009 by launching the development of a new standard that would go beyond the infrastructure and directly address the human factors, shell, and property of the data center.

The Uptime Institute development team was made up of data center professionals with previous hands-on operations experience. This was critical to the core objectives a) produce an owners’ standard; b) define the characteristics of site management, globally, without constricting unique, proprietary, or progressive site management methodologies; and c) avoid conflict with the purview of authority having jurisdiction (AHJs), local codes, regulations, or corporate groups. For example, the Operational Sustainability group is silent upon safety as this is the purview of the Corporate Safety group.

The Uptime Institute’s development team knew first hand that territorialism that comes with the intensive demands on a data center operation. An intrusive standard that forced a single way of doing things would be resisted. On the other hand, a narrow standard would be disregarded. The appropriate balance was met in a holistic standard that did not force uniformity. The solution was behaviors rather than requirements and a rating system that weighed effectiveness rather than pass/fail.
 

 

The concept was named operational sustainability, a term signifying sustained operations over the long-term. Operational sustainability ‘success’ is measured in terms of operations, maintenance, and risk identification and avoidance. Operational sustainability is founded on three principles

  • Proactive
  • Focused on anticipating the needs of the data center, rather than reacting to them
  • Dedicated to continuous improvement
  • Practiced
  • Disciplined approach to data center management
  • Exhaustive and exercised processes and procedures
  • Informed
  • Free flow of information, including site ‘wisdom’
  • Management decisions are all made from a position of knowledge

If these principles are followed, operational sustainability becomes an organizational phenomenon that redefines business as usual. Activities and actions that are inconsistent with sustaining operations are not tolerated. Among submarine operators, ‘hot shot’ firefighters, and flight crews, this communal, unfailing, fierce commitment to a common objective is known as a high reliability organization. Each member of the group is aware of his own and collaborative tasks to achieve the common objective. In the case of operational sustainability, the common objective is uptime.

From the outset, it was decided that the standard would be multi-functional. It would both define the aspects of site management and rate them in terms of influence on performance. Prioritization allows data center management teams to address the aspects that were most likely to result in an outage. This would avoid anecdotes misdirecting resources. Additionally, the standard would form the basis of a ‘triage’ plan that could be presented to management to ensure attention to the imminent issues.

The prioritization had to have business focus to speak to upper management’s prerogatives, and ‘real-life’ grit to attain credence with the on-the-ground resources. The business focus was attained by mapping to the tiers. The real-life credibility was provided by its basis in the experience of the data center operators.
 

Figure 3. Underfloor debris suggests a lack of proper procedure in the data center that could threaten its reliability

Operational Sustainability

Analysis of the Uptime Institute’s Abnormal Incident Reports database informed the structure and the prioritization of the Standard. (The database is currently 4,500 events strong, including 400 outages, and the only of its kind in the world.) This data justified the three elements of operational sustainability: Management & Operations, Building Characteristics, and Site Location. The database was then used to slice each element into categories, components, and behaviors. A summary of the three elements follows:

  • Management & Operations-The most impactful element. The bad news is that the greatest risk to a data center is the staff there to ensure its performance. The good news is that, of the three elements, it is the most changeable. Once the infrastructure is installed, the site location and building characteristics are locked in-modifications are possible, but constrained in execution and compromised in execution. However, the operations program can always be overhauled. Management & operations breaks down into categories of staffing plan, training, processes, procedures. In short, all those opportunities or staff to come into contact with the infrastructure.
     
  • Building Characteristics. Similar to the site itself, the type of building housing the data center can be the result of pressures beyond uptime. This rlement addresses whether the building is single or multi-story, single or multi-tenant, repurposed, etc. Additionally, security and access are weighed. This rlement also addresses enhancements to the design solution beyond the Tier objective. For example, a fault-tolerant UPS in an otherwise Tier III facility.
     
  • Site Location. It was important to posit that the property was not fated or unavoidable, but the result of management decisions. If cost or convenience drives selection, the primary purpose of the data center could be exposed. This element addresses the natural and human risks associated with the property. The ideal data center site selection is driven be a weighting of the associated risks. However, other drivers, such as available land, inclusion in a larger project or campus plan, can lead to a selection of a site better suited for office use than a building that must operate continuously. Accordingly, operational sustainability weighs both the risks and the mitigation measures in place.


 

First and foremost, “Tier Standard: Operational Sustainability” is meaningful for data center owners and operators. It is a common language for data centers that is linked to performance levels. It is also a means to establish the behaviors for staffing, training, maintenance, and other risk mitigation measures. And to justify these resources in terms of the performance objective, rather than vague and subjective ‘best practices’ advice.

Additionally, attention to operational sustainability does not come at the cost of other data center initiatives. “I am concerned that current industry emphasis on energy efficiency is causing some sites to lose focus on their basic mission which is uptime,” says Ken Brill. “I believe the operational sustainability process ideally balances reliability with the many actions required to also produce energy efficiency.”

“Tier Standard: Operational Sustainability” is released at no cost and available for download at the Uptime Institute’s website at all times. Changes to the standard will be affected through the established deliberation and voting process of the owner’s advisory committee. This committee of owners and operators will ensure that “Tier Standard: Operational Sustainability” meets the current and evolving requirements of those responsible for sustained operations.