Figure 1. Availability vs. cost


Reliability modeling enables an owner to make a business case for an investment in a new or upgraded mission-critical facility in answering the question: Is improving a facility or building a new (Tier III, smart Tier III, or Tier IV) facility a good capital investment? Reliability modeling-an analysis of the true cost of facility downtime-enables corporate executives to determine whether additional redundancy and reliability is worth the additional cost.

Consider the following hypothetical scenarios in which two different companies chose two different levels of reliability based on the outcomes of reliability modeling and Uptime Institute cost estimates for Tier levels. Company A, a credit card company, processes $10 billion a day in transactions from three large data centers, or approximately $138 million per hour per data center. In this scenario, a two-hour outage of a data center would cost the company $276 million in lost transactions. A new 150,000-square-foot (sq ft) Tier III data center facility, which has a few single points of failure, may have an availability of 99.99 percent and would cost $160 million. A new 150,000-sq-ft Tier IV data center facility that has no single points of failure has an availability of 99.999 percent and would cost $225 million. Although the additional investment is $65 million, the Tier IV data center could potentially achieve a return on investment, or savings, of $276 million dollars in avoided costs of a two-hour outage. Reliability modeling done by EDG2 formed the basis for Company A’s decision that the additional investment in the Tier IV facility would be worthwhile.

Figure 2. Sample reliability block diagram

Company B has a typical enterprise data center that handles the day-to-day operations for its 50,000 employees, generating $6 million per hour of revenue. A two-hour outage would cost the company $12 million. A new Tier III 30,000-sq-ft data center facility would cost $15 million. A new Tier IV facility data center would cost $37.5 million.

But there is an additional option. A fault-tolerant Tier III facility, which eliminates single points of failure with the same component count-what EDG2 calls a “smart Tier III” data center-would provide availability of 99.9995 percent at a cost of $25 million. In this case, a 30,000-sq-ft smart Tier III facility would require a $10 million increase in infrastructure over the cost of a Tier III data center, but it would pay for itself with one less two-hour outage anticipated in the first year.

After accounting for further outages, investing in a Tier IV facility is financially unwise as it would take Company B nearly 10 years to recover the additional cost. Based on the results of reliability modeling, the company decided that the smart Tier III facility represented the wisest investment.

RELIABILITY VS. AVAILABILITY

Consultants and other “experts” often talk about “five-9s” of reliability. That is a misnomer. The actual reliability numbers are much lower, more in the range of two-9s. The confusion comes in the differences between the definitions of availability and reliability.

Reliability is the probability that an item will perform its function for a stated time period; it is not a guarantee. Availability is the probability that an item will perform its required function under given conditions at a stated instant in time.

For example, a Boeing 757 is a very reliable aircraft. The jet may fly halfway around the world two or three times before it is then placed into the shop for one or two weeks of maintenance. However, according to USA Today, United Airlines recently grounded its entire 757 fleet to fix a computer problem. This aircraft is very reliable but has low availability due to frequent required maintenance.

Figure 3. Maintenance affects availability.

The availability of a system is always calculated the same way, regardless of past or future events. However, the reliability numbers are directly proportional to a period of time. So the longer an outage, the lower the reliability, regardless of the system design. The reliability numbers will indicate how often a facility’s infrastructure will require maintenance. Availability numbers imply the expected annual downtime as a percentage, which could mean that a facility with an availability of 99.9 percent was down for one 8.76-hour outage or for 525 one-minute outages over the course of a year. If a facility has 99.999 percent availability, it implies that the downtime will be 5.26 minutes per year. But it is more likely that it may suffer a 63-minute outage over the 12-year life expectancy of the facility infrastructure as there is no such thing as a one-minute outage. In fact, average downtime events are actually closer to two to four hours in duration. This factor should be considered in all calculations and estimations.

DECISION-MAKING

Reliability modeling enables an owner to make informed decisions by comparing the availability vs. costs associated with various infrastructure design options. These comparisons can be utilized for determining the following:



• Infrastructure architecture: Design goals and “acceptable outages,” as in the examples above.

• Infrastructure improvements: Current reliability vs. goals and outcomes. A review of actual and expected outages can unveil one or more single points of failure that can be eliminated to enhance availability and reliability.

• Identification of the weakest link in the infrastructure system architecture: Sometimes the system can be redesigned to eliminate a single point of failure with the cost of the redesign outweighing the cost of the likely downtime if the change wasn’t made. Although human error is the most common cause of data center outages, many outages are attributed to single points of failure.

• Availability vs. cost for different types of facilities: A Tier I facility has the lowest cost, but also the lowest availability (99.7 percent). The cost of a Tier II facility is higher, but so is the availability (99.75 percent). The availability continues to increase for Tier III (99.98 percent), smart Tier III (99.9992 percent) and Tier IV (99.9995 percent). Note that the cost of a smart Tier III is 75 percent less than the cost of a Tier IV, with only a minuscule decrease in reliability.



Here are four easy tests that can identify a system's reliability:
  • Chain Saw Test: Can the system operate if a pipe, feeder, or control wire is cut?

  • Shot Gun Test: Can the system operate if a component is broken?

  • Fire Bomb Test: Can the system operate if an entire room is taken out? Is the system compartmentalized?

  • Hand Grenade Test: Can the system operate without key personnel?
If the answer to all four is “yes,” then the system is probably a highly reliable system.

Different configurations can affect the reliability of a system. For example, a system with components in series will fail if one component fails. A system with parallel components will fail if the common point of coupling fails, but will sustain a redundant component failure. Building redundant component pathways eliminates single points of failure.

OTHER FACTORS TO CONSIDER

In addition to reliability and availability, there are several other factors to consider in designing a facility: maintainability, scalability, flexibility, and simplicity.

Figure 4. Like jet airliners, generators are highly reliable when maintained.

      
  • Maintainability: The probability that an item can be repaired in a given interval of time determines its maintainability. In most cases, a system will be maintainable if all single points of failure are eliminated.    
      
  • Scalability: A system should be designed so that the infrastructure can grow with load demand. A system’s reliability means little if it is not able to handle the projected load demand in five or seven years (or less).  
  •   
  • Flexibility: The facility should be designed so that the electrical and mechanical systems can be easily reconfigured to adapt to changing technologies.  
  •   
  • Simplicity: The more complex a system, the more potential failure points, and the more unreliable the system will become. Beyond the necessary level of redundancy, additional redundancies may add so much complexity that they actually impede reliability and availability rather than improving them.


COST OF 'DOWNTIME'

Company owners and facilities managers can determine the cost of downtime by understanding the company’s profitability on an hourly or daily basis and the criticality of the system itself. While the failure of facility supporting a highly popular service like video on demand might be a nuisance for a few hours, the failure of a mission-critical facility supporting a hospital information system can mean lost lives as well as cost to repair the image of the facility owner. The cost of the failure of the latter system will continue long after the actual failure is repaired.

Most facilities are designed with a useful infrastructure life of approximately 12 years before UPS and other significant mechanical and mechanical components must be replaced. So any improvements designed to increase reliability should pay for themselves in less time. The actual building’s useful life should be much longer.

Another consideration in the calculation is the net present value (NPV) of cost of downtime over useable life. All other factors being equal, an outage costing $1 million in lost revenue today is more damaging than an outage resulting in $1 million in lost revenue five years from now.

AVAILABILITY

Consider the capital cost for the different types of mission-critical infrastructures and the value of the additional availability. As system redundancies are added, the installation costs go up, the maintenance costs go up, and additional expertise may be required to support the increased complexity. Another consideration is that adding components creates more points of failure and will eventually lead to a point of diminishing returns whereby an infrastructure topology can become so complex that adding redundant components will actually reduce the availability of the infrastructure. Similarly, making modifications to existing systems does not necessarily mean that it is practical to employ. To make sense of this, it is important to perform availability vs. cost comparisons for both renovation projects and new construction projects to ensure that money is being well spent.

RELIABILITY AND MAINTENANCE

Data center facilities can significantly improve their reliability by performing bi-annual maintenance and annual assurance testing on all major pieces of equipment to uncover unknown failures. A reliability analysis will typically show a significant decrease in reliability within six months to a year. Performing bi-annual maintenance on the critical infrastructure is like resetting the clock on reliability for the next six months. Failing components identified during the assurance tests can be replaced without jeopardizing an unscheduled downtime.

Figure 5. Multiple rooftop chillers support this Tier IV data center

In order to calculate reliability, an engineer determines the amount that downtime would cost the enterprise on an hourly basis, the mean time to repair (MTTR) and the mean time between failures (MTBF), all of which should be available from company data. MTTR is the mean time to restore the system to operating condition. MTBF is the statistical point at which 63 percent of a large homogenous population of items will fail.

For example, a homogenous population of UPS modules has a MTBF of 250,000 hours. That means that in about 28 years, 63 percent of UPS units have failed. Note that MTBF significantly exceeds the average lifetime.

BENEFITS VS. LIMITATIONS OF MODELING

Reliability modeling can help an enterprise make decisions on infrastructure architecture, as in the example at the beginning, as well as on other financial decisions that compare reliability and cost for infrastructure improvements and component changes (e.g., eliminating the weakest link). However, there are also some important limitations to reliability modeling as it is generally practiced in the industry, which should be considered. Data such as MTBF aren’t as readily available from manufacturers of mechanical components as they are from manufacturers of electrical components, so reliability modeling is not typically used for making architecture and upgrade improvements for mechanical systems and components. 

Figure 6. Reliability modeling right sizes the electrical infrastructure.

As a result, the reliability of a mechanical system may be significantly lower than that for an electrical system. When these data are available for mechanical systems and components, for example, from IEEE, it is important to incorporate them into reliability calculations.

Soft costs, such as damage to a company’s reputation due to an outage, differ from business to business (e.g., entertainment vs. health care). So while soft costs are not part of actual reliability modeling, these should be taken into account when considering system design. Reliability modeling and analysis is a useful tool in making practical business decisions to right size the mission-critical infrastructure. While everyone's facility is critical to their respective missions, not all facilities require the same level of reliability or availability. Making a business case for reliability will address budget issues and justification for selecting the right infrastructure for your facility or for making practical modifications to improve your facilities infrastructure based on informed and unbiased decisions.