In the world of mission-critical computing, the term data center and its implied level of availability has always referred to the physical facility and its power and cooling infrastructure. With all the marketing hype of the cloud, what constitutes a data center may be up for re-examination.
The concept of defining availability and uptime classification levels was originally conceived by Ken Brill, founder of the Uptime Institute almost 20 years ago, when he created the Tier system of availability (Tier I, II, III, or IV, with the specific use of Roman numerals considered as their copyrighted and trademarked intellectual property). The highest level of availability is Tier IV, which is generally described as “Fault Tolerant And Concurrently Maintainable.” Moreover, despite the demand for the ultimate level of availability, this highest level is quite costly and not so easily achieved. In fact, as of December 2012, there are only eight Uptime certified Tier IV constructed data centers in the world, according to the Uptime Institute’s website.
It should be noted that the Telecommunications Industry Association (TIA) also has a similar framework, known as Tier 1-4, which was formalized and originally released in 2005 as TIA-942 and updated 2010 as TIA-942-2. The TIA documents define the requirements for each tier level,.However, unlike the Uptime Institute, the TIA does not do any reviews of designs or certifications of constructed data centers.
In essence both these organizations’ availability tier classification definitions, redundancy requirements, and recommendations are primarily based on the redundancy levels of the physical power and cooling equipment and independent paths and system failover, as well as the ability to provide maintenance on the facility’s infrastructure without impacting the critical load.
Now before you all start wondering if I have some secret source at the Uptime Institute or the TIA that no other blogger or columnist has, I will explain what I am defining as a “Tier 5” data center. I recently saw an advertisement from VMware for its vCloud Suite 5.1 software that stated that they had virtualized the data center and declared it the “Software-Defined Data Center,” and the ad concluded with the statement, “It is the datacenter for the new cloud era.”
The availability of cloud computing (really a nebulous term), depends not just on the physical infrastructure of the data center, although it is certainly part of the overall availability, but on the IT systems’ diversity and fault tolerance. The essence of the cloud concept is the presumption of 100% availability via the Internet from anywhere and on any device. Presumably, there is data replication to multiple physical sites (either fully synchronous in real-time or to be synced within a stated/unstated period of time of critical data service transactions). Nonetheless, there seems to be a fundamental and popular belief by the general masses that the cloud is somehow magically always available (there are no 9s in “always”) and inherently fault tolerant. Some in the IT industry itself are in fact shocked and surprised when there is an outage in a major service provider such as Amazon’s service offerings, or if Google’s Gmail is down.
In point of fact, even before this ad appeared, some hosting, collocation, and managed service providers have been offering virtual private data centers to provide public and private clouds using VMware’s software.
So what does this really mean to the future definitions of a data center? As I initially stated, in the traditional mission critical world the term “data center” and its implied level of availability has always referred to the physical facility and its power and cooling infrastructure, and indirectly to the data communications network that connects it to the outside world. However, by adding “virtual,” perhaps the boundaries and reference points of availability tiers may need to be redefined.
While I will not spend too much time on all the advantages that virtualization brought to the computing environment, one of its key features is the ability to seamlessly move applications running on a virtual machine from one or more physical servers (and now storage and networks), whether the other servers were in the same rack, the next rack, a rack in another row, or in even in another data center. The tier level definitions of physical equipment and power path redundancy criteria do not translate easily or directly to the concept or actuality of virtualization or cloud computing.
This added a new dimension to the functional definition of availability that is outside the scope of the existing facility-based tier system, which currently does not include or factor in IT architecture’s computing redundancy and failover capabilities (beyond utilizing IT equipment with redundant dual corded power supplies).
So, if in fact we can replicate and synchronize data and applications, as well as provision and move them from one or more physical devices in diverse locations, we have to consider the fact that we can exceed the availability levels and fault-tolerance capabilities of computers located in single physical building, even if it is a fully qualified tier 4-designed and built site. In effect by having two or more physical sites that can act as one virtual data center, it can potentially exceed the projected 99.999% availability that a tier 4 site presumably is designed to achieve, but is not immune to human error or natural catastrophic events that could bring it down or destroy it. In practical terms it also means that potentially two or more lower tier level (tier 2 or even tier 1) physical facilities can provide higher level of overall availability and business continuity (rather than disaster recovery) than any single tier 4 facility (or even Tier IV if you are willing to engage Uptime).
While my 2+2 = Tier 5 math may seem absurd to some, perhaps they need to consider the ultimate purpose of the data center — to provide computing, not just the power, cooling, and physical security of the facility — which, while it is a necessary pre-requisite to support the computing equipment, does not provide the fundamental deliverable: computing.
While not every application may failover perfectly or seamlessly yet, we cannot underestimate the importance of, or rethinking and including the ability of, the IT systems to be part of our overall goal of availability when making decisions about required redundancy levels of facility-based infrastructure, required to meet the desired level of overall system availability.
The ability to shift computing loads across hardware and be fault-tolerant is not new and has been done many times for some dedicated mission-critical systems, far before the advent of the internet or virtualization software. Server clustering technology coupled with redundant replicated data storage arrays has been available and in use for over 20 years. In addition, more recent internet-based architectures have been proven effective on a broad scale by search and social media firms such as Google, Yahoo, and Facebook, who use multiple, physically diverse sites that are still able respond to web requests, even when there failures of a server or group of servers and even major site outages.
And while I recognize that simply reattempting and redirecting a web request or a retry for a free Google search is not the same as maintaining the availability and integrity of a commercial or financial enterprise’s transactional database, it is well within the capabilities of today’s IT systems technology. We have had mission critical fault tolerant clusters and real-time data replication that is proven and tested to work in physically separated sites for the major stock exchanges, where a single trade may be worth billions of dollars (in effect the cost of failure is measured in millions of dollars per millisecond). The New York Stock Exchange, as well as many others, has systems optimized for transaction processing and replicates data synchronously to storage arrays generally located within 25 miles (40 km) of each other (but sometimes closer due to lower latency requirements of high-speed trading), so that even in the event of a major catastrophic event, every transaction was recorded in both sites.
Even within the same site, it is possible to improve availability and fault tolerance while reducing the redundancy and complexity of a tier 3 or 4 site, by partitioning a single physical building into independent fully autonomous smaller sections or adjacent building, each with only a tier 1 or tier 2 level of redundancy, yet linked to each other only by communications networks to allow the IT equipment to replicate data and applications.
THE BOTTOM LINE
Albert Einstein once stated, “Not everything that can be counted counts, and not everything that counts can be counted.” In the existing tier system, we can count the redundant hardware and project the availability, typically expressed by the number of 9s. However, the projected number of 9s do not really count if a data center experiences an actual failure.
By maximizing redundancy and distributing the computing resources and fault tolerance of your organization’s computing architecture, 2+2 could equal 5. So before you design or build your next data center, discuss if and how your organization’s IT architecture can improve overall computing system availability while reducing the dependency on the physical redundancy of the individual facility infrastructure.
The holistic approach of including an evaluation of the resiliency of the IT architecture in the availability design and calculations should be part and parcel of the overall business requirements when making decisions regarding the facility tier level and number of physical data centers, as well as their geographic locations. This can potentially reduce costs and greatly increase overall availability as well as survivability during a crisis.
Hopefully, my introducing the conceptual term Tier 5 for the virtual data center will not be considered frivolous, nor start a wave of balderdash comments by the traditionalists in the data center community. Ideally, it should be a motivator for a sense of shared responsibility by both the IT and facilities departments, as well as a catalyst for the re-evaluation of how data center availability is ultimately architected, defined, and measured, in the age of virtualization and cloud-based computing — without the hype or posturing.