Modern data center power systems represent the ultimate in optimization for reliability. This is a necessity, since the computer and IT equipment that these power systems serve are very sensitive to even momentary loss of power. Tier IV systems (Uptime Institute classification system) rank at the highest level and boast a representative site availability of 99.99 percent, or four nines. But just what does availability really mean? Can a single number really communicate all the necessary information about the reliability of a power system? What about individual component statistics such as MTBF? Do other characteristics, such as maintainability, affect how often there will be problems with power to critical equipment?




Figure 1. Exponential probability density (a) and distribution (b)

The Nines

Logic dictates that the longer a given component is in service, the more likely failure becomes. The exponential probability density and its associated exponential probability distribution, both shown in figure, characterize such behavior. Assuming that time between failures of a given component behaves according to the exponential probability density and that the distribution leads to the assumption that there is a constant mean time between failures (MTBF is defined as the mean exposure time between consecutive failures of a component) and associated failure rate for that component (the mean [arithmetic average] number of failures of a component or system per unit exposure time) creates a fundamental issue. The use of constant MTBF and Failure Rate in reliability analysis is based upon the assumption that both the probability density and distribution functions for the time between component failures are exponential.

Engineered systems can be considered to consist of multiple components, each of which has its own set of failure characteristics. System-level reliability analysis considers how these components connect to form the system and uses this information to calculate the various aspects that describe the reliability of the system. There are a number of methods available to do this.

Whatever method is used, there is capability to calculate certain “reliability indices” for the system. The most common of these is availability, defined as the ability of an item - under combined aspects of its reliability, maintainability, and maintenance support - to perform its required function at a stated instant of time or over a stated period of time.





From this rather vague definition come two different types of availability:
  • Inherent Availability (Ai): The instantaneous probability that a component or system will be up or down. Ai considers only downtime for repair due to failures. No logistics time, preventative maintenance, etc., is included.
  • Operational Availability (Ao): The instantaneous probability that a component or system will be up or down. Ao differs from Ai in that it includes all downtime. Included is downtime for unscheduled (repair due to failures) and scheduled maintenance, including any logistics time.
A second set of indices deal with downtime, defined as:
  • •    Mean Downtime (MDT): The average downtime caused by scheduled and unscheduled maintenance, including any logistics time.
  • • Repair Downtime (Rdt): The total downtime for unscheduled maintenance (excluding logistics time) for a given time period.




Figure 2. Illustration of system reliability changes with time

Reliance on Availability

Availability is the most often-quoted specification regarding the reliability of a data center. A much sought-after goal is five nines of availability. With 99.999 percent availability, in a given year only 0.001 percent of the time, or 5 minutes, 15.36 seconds, is downtime. Is this one outage of 5:15.36? Or, is it five 1-minute outages plus another 15.36 -s outage? In reality, availability is too vague a figure to make such specific predictions when used alone. The repair downtime or, if available, the mean downtime should be used along with the availability to describe the reliability of the system.

Even more worrisome is the difference between the two types of availability. Operational availability provides a real-world measure of the availability of the system. Inherent availability provides a way of comparing system designs without factoring in maintenance and logistics concerns, which can vary from facility to facility. If a reliability study only calculates the inherent availability, the true operational availability of the system will likely be lower. Also, the system reliability, and thus the availability, decreases over time as the sum aggregate effect of component aging, as illustrated in figure 2. The true operational availability is as given in the figure, i.e., mean uptime divided by the sum of mean uptime and mean downtime.

Figure 3. Illustration of time-varying nature of component failure rate

Component MTBFs

Assumptions made about MTBF for each system component in most reliability analyses have a dramatic effect on the outcome of the analysis, and therefore valid data are essential. In the absence of specific data from the component manufacturer, survey data such as that presented in the Uptime Institute’s Tier Classifications Define Site Infrastructure Performance can be used. Indeed, using such data is recommended in lieu of manufacturer-specific data unless that manufacturer’s component will be used on the project in question or unless that manufacturer’s data are the worst-case for all of the potential component manufacturers being considered for the project.

In reality, component MTBFs are not constant. The MTBF of a component changes over the lifetime of the component. This change is shown graphically in the bathtub curve (figure 3 on  page 40). The period of heightened failure rate at the beginning of component life is known as infant mortality. Hopefully, failures associated with infant mortality are caught during the commissioning process. The period of heightened failure rate at the end of component life is known as wearout. Wearout makes it necessary to follow the manufacturer’s recommended maintenance schedule in order to avoid lowering the reliability of the system. Between the infant mortality and wearout points the failure rate is approximately constant. It is this period of time, and this period of time only, for which the results of a reliability analysis that utilizes constant component MTBFs is valid.

It is problematic to use constant MTBFs because it is simply either very difficult or impossible to give such a number without survey data for most types of components. Most component types have many failure modes, and these failure modes are not necessarily independent of one another. Attempts to supply such data based upon any method other than field history data (which can take years to build up) or a rigorous reliability study of the component in question can result in overly optimistic results.

Another aspect of component failure rates is the definition of “failure.” In establishing failure rates for components, the manufacturer and the end-user must agree on this point, otherwise component failure rates have no common basis of meaning. Per IEEE Std. 493-2007, IEEE Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems.

The definition of failure is the termination of the ability of a component or system to perform a required function.

Predictably, disagreements can arise regarding the precise meaning of this definition.




Maintenance

A hallmark of a well-designed data center power system is maintainability. The necessity of maintainability is unavoidable: In order to keep component MTBFs to their normal, nearly constant values, maintenance must be performed. Otherwise, the wearout line of the bathtub curve may be crossed, with a resulting increase in MTBF. Maintenance can also help pinpoint abnormal sources of component deterioration, such as overloaded circuits, improperly set protective devices, and changing voltage conditions, etc.

As illustrated in figure 2, the reliability of the system is not constant but rather decreases over time. Ideally, when reliability reaches the lowest acceptable level, maintenance brings the system back to an acceptable level, and the process repeats. In reality, however, the need for maintenance is rarely quantified with this degree of detail. For a maintainable system, the reliability of the system during the repair time does not go to zero as shown in figure 2, but only to a reduced level. The better the maintainability of the system the higher the reliability level during maintenance. Parallel power paths and proper switching devices to allow component isolation, etc. help achieve such maintainability.

The need for benchmarking is another driver of component maintenance is. During maintenance, tests such as the high-potential tests and thermal scans are performed. The results of such tests are most meaningful when they are tracked over time. An abrupt change in a test result usually signals a problem, and the best way to notice such a change is by comparing with test results from previous maintenance periods. Computerized storage of such records facilitates this process.

Commissioning

Commissioning is not just the simple startup of components, which only tests the component in question and is designed to bring it up to the point where it can be energized. At the highest level, commissioning tests whole systems and across systems to make sure all components work together properly. Commissioning takes a system-level approach, with the goal being to ensure that the facility is functioning according to its intended purpose. It should test real-world conditions. Often, system interoperability problems can only be found through commissioning.

Emergency contingency procedures are a must to allow speedy resolution of power system issues while minimizing the impact on critical loads. Such procedures should list step by step the actions to be taken in a given type of emergency. Unfortunately, such procedures are not the norm for data centers, and even if they exist their use is dependent upon trained maintenance staff. In many instances, maintenance is contracted with minimal on-site staff to cope with emergencies. In some cases, even the on-site staff is not familiar with system operation, instead relying upon component manufacturers to supply this familiarity. The result is that when an emergency does occur, no one is familiar enough with the system to properly implement any emergency procedures that are in place.

Adequate system documentation is also a must, and as critical load components and the infrastructure to support it are added this documentation must be maintained. If required, the services of the original engineer of record for the facility should be retained to keep system documentation up to date. Having such documentation available in key easily accessible locations (such as posting single-line diagrams for easy reference) is also a must.




Summary

Data centers are dynamic systems, and therefore any type of analysis that provides a snapshot picture of the performance of such a system may give overly optimistic results. A rigorous reliability analysis, when used to compare alternate system designs or rank the relative performance of different facilities, is a powerful tool. However, other uses of the results of such analysis, such as attempting to predict the actual availability the facility over time, require real-world factors such as over-reliance on availability figures, complexities associated with component MTBFs, the importance of commissioning and maintenance, and the need for adequate emergency procedures and trained maintenance staff to be into account in the analysis. Being cognizant of these real-world factors, going beyond the nines and implementing accordingly, gives the best chance for success in keeping data center power systems up and running.