Defining Data Center Availability
It not just about four levels of prescribed redundancy.
Happy 2018, and if you have not heard these terms last year, container, cloud, colocation/MTDC, hybrid, edge, modular, and OCP, applied to data centers enough times — it is time to update your subscription to Mission Critical Magazine. While each has it special flavor and primary purpose, they are all variations on the basic function of a “data center” — a secure facility to provide conditioned uninterrupted power and proper environmental operation conditions for IT equipment.
In my last column, “The Age Of The Megawatt Minute,” I pondered if it was time to consider reducing UPS back-up time requirements, in the age of virtualized computing, multi-site geo-diversity with data duplication, and last but not least, cloud services. Traditionally, the data center is risk adverse and for good reason, especially when it comes to ensuring power availability to the IT equipment.
That brings us to today’s discussion of defining “availability.” In the data center universe many people tend to use and interchange the terms availability and “reliability” as same thing. Moreover for some, the term “redundancy” seems to also connote the implication of availability, as well.
RELIABILITY IS NOT AVAILABILITY
Reliability is the ability of a system or component to perform its required functions under stated conditions for a specified period of time. The reliability of a component is a prediction based on a statistical probability known as Mean Time Before Failure (MTBF), typically expressed in hours (e.g., 100,000 hours). The claimed MTBF of a component, device, or sub-system is usually specified by the manufacturer of the item.
The availability of a system is typically expressed as a percentage of time. For data centers, it is stated as “uptime,” quantified by the number of “9s.” It should be noted that the proverbial five 9s (i.e., 99.999%), is the system availability reference standard originally set by “Ma Bell” back when everyone used landlines.
While five 9s sound impressive, however, based on 8,760 hours a year, it still represents 5.3 minutes a year of downtime. Even six 9s corresponds to 32 seconds of downtime per year. In today’s 7x24 environment, this is clearly not acceptable since IT power supplies can only tolerate a disruption of less than 20 milliseconds. Moreover, this does not necessarily mean a single outage in a given year. It could be multiple failures of only a few seconds each, adding up to 32 seconds per year and although obviously catastrophic, it would still be mathematically accurate and meet the claim of six 9s.
The important difference, as far as availability claims go, is a projected vs. a historic number. In the case of a newly built data center or its design, it can only be a projection (presumably based on its level of redundant equipment and the sophistication of its fault-tolerance control systems). In contrast, historic availability numbers represent actual past operating experience. However, as they say in the stock market, history is no guarantee of future performance. An N+1 facility may not have had any outages for over five years, while a 2N+1 designed site may have dropped the critical load in the first year of operation.
Redundancy represents the additional equipment which can provide the required power or cooling (defined as “N”) should the primary source or equipment become unavailable — either through failure or during maintenance. However, this simple statement does not ensure that the transfer to the secondary or additional equipment is seamless or instantaneous. A simple example is the loss of primary utility power, and the time it takes a backup generator to start and be able to supply power to the load (typically 10 to 30 seconds). Clearly this does not work for IT equipment and necessitates the use of a UPS with enough energy storage to cover the expected ride-through time (as discussed in “The Megawatt Minute”). For cooling systems the acceptable time varies based on the type of cooling system and can range from five to 30 minutes for low density facilities, or as short as only 15 to 60 seconds for very high density IT equipment.
Resiliency (fault tolerance) is combined with redundant equipment to control the power and cooling to support the IT load. Redundant equipment in and of itself does not preclude a momentary or short interruption. We use the combination a fault tolerant design in conjunction with an amount of redundant equipment (N+1, N+2, etc.) and critical paths (N, 2N, etc.) to deliver power and cooling (as well as networking), in an acceptable timeframe, to allow the IT equipment to operate without disruption.
While having “reliable” equipment may reduce the chances of having a system failure, it does not ensure availability. The true basis of “availability” is essentiality a product and result of the redundant equipment and the design of fault tolerance and the control and transfer time of the power and cooling systems, each having different allowable tolerances for interruptions. In effect, never base your availability expectations on the projected reliability of equipment — even a brick can fail.
Taking a more holistic approach, The Green Grid is working on the first release of its Open Standard for Data Center Availability (OSDA). While not intended to be in direct contention with the Uptime 4 level tier system, the OSDA concept adapts the classic view of redundancy levels of facility power and cooling systems, however it also incorporates multi-site data replication in the overall scheme to increase the logical availability of the application, not just the status of the facility infrastructure. The OSDA system is also more flexible in that it allows differing levels of redundancy for power and cooling, rather than the inflexible framework that does recognize that some organizations (or some applications) may require higher electrical redundancy such as 2(N+1), but only want N+1 cooling redundancy. When it is fully developed, the OSDA platform and toolset can be used to evaluate how multi-site data replication can provide the same or higher levels of application availability (which is why we build data centers in the first place) on a scale of 1 to 10, even while using lower redundancy levels of site infrastructure.
Then there is the cloud, deemed as the “perfect” solution by management, since it presumably eliminates all the capital and operational costs and personnel associated with a physical data center, as well as the IT hardware. While it is blindly presumed to be always available, in reality, the underpinnings of cloud service providers are far more nebulous or totally opaque. Despite this, even today many organizations, large and small, government and commercial, are not really able to decide on a meaningful method to evaluate the availability of cloud computing services.
Computing architecture has become highly dynamic and continues to evolve at an ever increasing pace, and it has become clear that most enterprise organizations have forgone building or operating their own new facilities. Many have gone to colocation providers for their facilities, which they can evaluate using traditional infrastructure based on redundancy methodologies. As a result, the hybrid approach of colocation and cloud has become the current favorite strategy for many organizations.
THE BOTTOM LINE
My previous “Tier Wars” article provoked many readers to express some strong opinions regarding the long held industry yardstick of “data center availability,” the four-level Tier Classification System created by Ken Brill, founder of the Uptime Institute. While still a valuable (and fundamental) concept and index, it only evaluates data center facility infrastructure, not the availability of the IT hardware, software, and, of course, the data itself. In the age of the virtualization and data replication, basing the “availability rating” of a data center solely focused on the redundancy level of the facility power and cooling infrastructure, while still important, it should no longer be the sole element of evaluating the availability of computing systems and stored data.
So functional resilience of the software and application requirements should be one of the more significant aspects when planning and architecting your overarching computing strategies. As an example, the Open Compute Project puts forth a total re-imagination of both the physical, electrical, and logical aspects of IT hardware, as well as the electrical equipment, mechanical infrastructure, and the design of the building itself, based on their hyperscale and operational considerations of their members such as Facebook, Google, and Microsoft. In many cases their facility redundancy levels are relatively low (e.g., “N” or N+1 for some systems), but their overall availability is high due to their software failover redundancy and multi-site data replication. While some characteristics of their requirements differ radically from traditional enterprise organizations, some aspects of those designs, equipment, and software strategies should be considered and adopted if appropriate.
And last but not least, there is the bitcoin, promoted as the basis of world’s future currency (as well as a get rich scheme). Most of the newest and largest bitcoin data centers seem to be the antithesis of traditional data center facilities. In point of fact many are built without a UPS, back-up generator, and little or no cooling. They are driven by their sole purpose: lowest cost to operate bitcoin mining rigs which can simply stop without damage when power is lost and begin mining as soon as power is restored. So even only two 9s of utility availability is much more cost effective than the substantial additional initial and operating costs of a complete power chain.
Nevertheless, ever larger colocation facilities and cloud service data centers, as well as hybrid solutions, will dominate the landscape for the next few years, and organizations need to evaluate the long-term cost and risks of each overall solution. In this ever evolving computing environment, what constitutes “availability” is a choice best made based on its purpose, not a fixated view only on the traditional reliance of a rigid four level facility system.