Reliability vs. Complexity In The Data Center
Avoid the conflicts.
I have been working in critical facilities for over 35 years including the commercial nuclear power industry, aerospace, a financial institute, and in the professional services realm. For the last 10-plus years I have been involved with providing third-party commissioning services and facilities management consulting. One thing that has become sometimes painfully apparent is there is often a conflict between achieving the desired level of reliability without over complicating the matters at hand. This concept applies to more than just the infrastructure topology, but throughout most all aspects of facility operations and maintenance.
If you look at the evolution of critical facilities you will see a fairly consistent increase in complexities as the requirements and expectation of sustaining continuous operations became more and more demanding. When I first started working in data centers in the mid-1980s, static uninterruptible power systems (UPS) were just beginning to be the norm and sites still required annual outages to perform critical maintenance activities. In many instances, maintenance staff were required to perform “energized work” where staff would have to perform re-torqueing of connections while the gear was still energized. Even so most sites had to shut down for the main utility “triennial” switchgear maintenance.
In response to the ever increasing industry expectations of 7x24xforever continuous operations, the critical facility design firms looked to innovation to come up with better and more resilient designs. The manufacturers of critical facility products joined the movement by coming up with more resilient and capable equipment and systems.
Manual transfer switches were replaced with automatic transfer switches which were surpassed by static transfer switches. Centralized monitoring and control systems evolved into distributed control systems which eventually included mirrored-redundant processors. Chiller manufacturers started marketing “quick restart” control packages and sequences-of-operations were devised to shorten the recovery of central chilled water plants following power outages.
In 1993, Kenneth Brill founded the Uptime Institute and soon thereafter published a white paper titled Tier Classifications Define Site Infrastructure Performance and the industry started categorizing various site topologies by the “Tier Classification” system.
The Telcom industry also established standard classifications for reliable infrastructure when the Telecommunications Industry Association published ANSI standard TIA-942, Telecommunications Infrastructure Standard for Data Centers. Around 2004, ASHRAE Technical Committee 9.9 (TC9.9) was formed and began publishing various white papers and guidelines addressing the common industry interests especially about how to design, deploy, and support the growing IT industry and associated computer and communications equipment. In recognition that the Telco and Data Center industries were converging, the term “Datacom” was coined basically combining “data centers” with “telcom,” and TC9.9 then began publishing the “Datacom Book Series.”
The critical facility industry began using many terms that we now take for granted such as single-point-of-failure, fault tolerant, and concurrently maintainable. System-level redundancies were defined as “N,” “N+1,” “2N,” and 2(N+1). As if this wasn’t enough, there were hybrid-designs such as “2(N+1)/3” where equipment redundancies were spread across three separate line-ups. A “tier-1” site would have known single-points-of-failure and require periodic outages to accomplish required maintenance and repair activities, whereas a “tier-4” site had to be simultaneously concurrently maintainable and fault tolerant. This means that even when an entire system including the distribution path was taken out of service the remaining infrastructure would have sufficient equipment-level redundancy to remain fault tolerant.
Along the way, the critical facility industry started measuring reliability based upon standard definitions and operational requirements. A hypothetical Tier-1 site would provide 99.671% uptime (or 28.8 hours of downtime annually). Likewise, a Tier-2 site would provide 99.749% uptime (22 hours of downtime annually), a Tier-3 site 99.982% uptime (1.2 hours of downtime annually), and a Tier-4 site would provide 99.995% uptime (or less than 30 minutes of downtime annually).
But the hypothetical did not always match reality. Some Tier-2 and 3 sites were out performing Tier-4 sites. More research was performed and the results determined that most of the unanticipated outages and mission impacts were not due to equipment or system failures. In most cases the root-cause wasn’t even directly related to the infrastructure at all. It was due to “human activities” with the most preponderance being directly related to “human error.”
What had occurred was the engineering and design community, in their quest for ever increasingly reliable infrastructure and topologies, had introduced so much complexities that they exceeded the capabilities of most site operations and maintenance staff to understand, manage, and operate the sites when the inevitable anomalies and failure scenarios materialized. In essence, the required facilities management processes did not keep pace with the increasing site complexities.
Control sequences-of-operations became more convoluted in trying to address, through automation, every conceivable mode of operation, configuration, as well as all anticipated failure scenarios. Not only did operating procedures become far more detailed, but they became far more site specific. Likewise, O&M staff training requirements, maintenance demands, and operating protocols also became far more site specific. Unfortunately, many sites failed to provide the appropriate level of care for the operating staff and facilities management organization as was needed to ensure the site was operated and maintained in the manner necessary to optimize site performance.
I wish I could report that this trend has finally reversed but that is not the case. As critical facilities started to pursue better energy efficiencies and lower their utility bills, they have continued to introduce even more complexities. Free-cooling “economizer” solutions, full “hot/cold aisle” containment, and rack/row-level HVAC controls are clear examples of where new complexities are being overlaid upon the “tier” related complex topologies. New data hall temperature control strategies use “artificial intelligence” to “teach” each computer room air conditioner/handler (CRAC/CRAH) how it influences the overall room cooling not only during normal operations, but during various failure scenarios as well as when the IT equipment deployment changes over time. In each case we seem to continue to rely more and more on the complex systems to perform continuously with less and less reliance on human intervention.
The Achilles Heel is that when the infrastructure fails to perform as expected, the emergency responders require higher technical expertise to the point of becoming specialists, and very few sites have the necessary specialists on staff. For those of us who are involved with the startup and acceptance testing of these highly complex sites and systems, we are starting to see that even the manufacturers, vendors, and local technical representatives are being challenged in keeping fully abreast and competent in understanding and delivering optimized system performance for their products. There is still much wisdom in the well-known phrase “keep it simple,” especially when your goal is to provide continuous operations through high reliability.