In today’s complex data center environment, operators must continuously juggle multiple integrated mechanical, electrical, and data systems to minimize the risk of critical failure. Data center managers traditionally have relied on preventive maintenance (PM) programs to manage risk and keep facilities and systems running. Reliability and performance are the goals, and detailed maintenance routines include tasks such as overhauling a rotary uninterruptible power supply (UPS) or generator and replacing system components on a schedule to prevent failure during live operation.

However, some of these standard operating procedures are not optimal and can even introduce more risk into the system. For example, replacing certain non-critical components as part of routine maintenance can actually increase the risk of a critical failure, rather than reduce it. While it may be counterintuitive, data center managers would actually be better served by letting certain components “run-to-fail” (continue operating in place until the component wears out or fails naturally.)

Reliability-centered maintenance (RCM) is a different approach that uses analysis to determine the appropriate failure management strategy for each component in a system to ensure performance in the most cost effective way. RCM seeks to preserve system function over equipment function, not just operation for operation’s sake. This approach can yield a quick return for any data center.


RCM was first developed by the aviation industry to manage system-wide maintenance. Airline maintenance is a high-stakes endeavor: in a data center, a critical component failure may cause system downtime; in the airline industry, critical component failures can have potentially life-threatening consequences.

Early PM programs were designed based on the assumption that periodic system overhauls are needed to ensure reliability and safety (see figure 1). Replacing components before they had a chance to wear out and fail seemed a prudent approach. However, as maintenance costs began to rise in the 1960s without delivering any corresponding increase in reliability, airlines began to question this assumption. The FAA formed a Maintenance Steering Group (MSG) to research the issue and recommend new approaches.

The full findings of the FAA’s Maintenance Steering Group were incorporated in a 1968 report titled, MSG-1. The MSG determined that overhaul maintenance is indeed flawed. The overhaul philosophy assumes that most equipment suffers from a “bathtub curve” (see figure 2): there is an initial break-in period during which risk of failure is high, followed by a fairly long stable operations phase, and lastly a definable wear-out point. In an overhaul, components are removed from the system prior to their defined wear-out point and replaced with new components.

Surprisingly, the study found that implementation of overhauls in many cases provided no improvement in safety or reliability. In fact, in some cases, the performance of the system actually worsened after an overhaul.

Overhauls increase system risk for two reasons:

  • As maintenance is performed, there is always some potential for the introduction of maintenance (human) error
  • As new components are introduced into the system, a certain percentage will suffer from “infant mortality” (failure during the initial break-in period)
  • In the airline industry at that time—as in the data center industry today—overhaul limits were often not determined by analysis of the actual performance parameters of components, leading to high maintenance costs with little benefit.

“Data center maintenance is all about reliability, but without knowing it, data center managers are introducing risks into their environment by doing things the same way they have always done them. It is counter-intuitive, but sometimes it might be better to not maintain something until it fails,” said Gary Lee, president/CEO, Universal Asset Management.

In fact, many components have no wear out characteristic; there isn’t a universal bathtub curve, but many failure rate profiles for various components (see figure 3). For a component with pronounced wear out, for example, it may be most cost-effective and less risky to replace the component proactively as part of a routine overhaul. But for other components with different failure rate curves, Run-to-fail might be more cost effective. For equipment with a constant or decreasing failure rate, the overhaul approach means a substantial portion of the useful life of the component population is being sacrificed.


RCM was first applied to the Boeing 747. The key to RCM is an analytical process that looks at the reliability of various components and the severity of the consequences if component failure occurs. The consequences are the measurable impact of an equipment failure in the areas of operation, performance, safety, environmental impact, and economics. An organization can evaluate the value of each maintenance task by considering all of these factors.

RCM represents a sea change from the way data centers are typically managed today, where the priority is on preventing or minimizing failures. With an RCM approach, however, the goal is not necessarily to prevent all failures, but to manage them cost-effectively, without compromising safety or performance, while meeting mission requirements. RCM reduces the consequences of failures to an acceptable level. The focus is not on component failure, but on function failure.

“The airline industry and NASA have proven that reliability and costs can improve significantly with an effective reliability-centered maintenance program. The data center industry can learn in a lot of ways from other industries, affording it the opportunity to adopt and implement reliability-centered maintenance as business demands continue to increase in this sector,” said Jason Schafer, research manager, Datacenter Technologies, 451 Research.


In the context of RCM, there are a multiple maintenance disciplines that, if used in concert, will optimize system performance while minimizing maintenance costs. These are PM actions and corrective maintenance.

PM actions and corrective maintenance are taken to preserve functionality, reduce unplanned downtime, and minimize impacts to mission performance. By their nature, PM actions require some level of investment in activities that go beyond simple corrective maintenance such as inspecting, monitoring, refurbishing, or replacing. The RCM process helps organizations evaluate the trade-off between investment in these activities and overall operating costs.

Corrective maintenance or run-to-fail (RTF, also referred to as reactive maintenance or repair) responds to failures only after they occur as a result of deterioration in equipment condition from active use. RTF may be the most effective approach for many types of equipment where the consequences of a failure are deemed to be minimal or acceptable. RCM analysis compares the risk and cost of failure against the cost of PM activities that would be required to mitigate or prevent that failure. RCM employs an integrated “failure management strategy” that helps determine the proper balance between planned (preventive) and run-to-fail approaches.

For non-critical system components, for example a single cooling fan in an array of fans, the impact of a failure would be minimal. The RCM approach includes identifying components where investment in redundancy and backup yields a better net cost than investing in PM. For example, if a system is configured so that 30 percent of fans in an array can fail before system performance is compromised, an organization can save time and money by delaying maintenance until 25 percent of fans have failed, rather than replacing each failed fan on an ongoing ad-hoc basis, or engaging in routine fan overhauls before failure occurs. Letting each fan run to the end of its life extracts maximum value from the original equipment investment.

  • Predictive Testing & Inspection (PT&I). PT&I is a cornerstone of RCM. By understanding the statistical performance and failure parameters of each component, identifying key indicators that are the precursors to failure, and by setting up inspection or sensor regimens to monitor for those precursor signs in critical components, an organization can design the optimal RCM program. Based on ongoing PT&I monitoring, maintenance teams can respond to real-time data to repair or replace equipment when data indicates the condition is deteriorating/at risk of failure.
  • Proactive Maintenance. In environments like data centers where the consequences of failure can vary dramatically, proactive maintenance provides a solid body of evidence-based information and data that help organizations better understand and plan for performance parameters, and make decisions based on sound technical and economic justification. For example, proactive maintenance resources and activities can incorporate equipment specification, failed-part analysis, and reliability engineering.
  • Other Activities. A comprehensive RCM approach also includes actions to address issues that compromise acceptable levels of reliability. These actions can include redesign, procedural changes training, improvements to maintenance manuals, or insertion of new technology in the maintenance or operational environment. For example, figure 4 shows a facility dashboard that integrates multiple factors to help track and predict cooling capacity.


RCM is an integrated approach: it employs PM, PT&I, run-to-fail, and proactive maintenance techniques to take advantage of each of their respective strengths to ensure facility and equipment operability and efficiency, while providing the required reliability and availability, at the lowest cost. NASA, in its February 2002 Reliability Centered Maintenance Guide for Facilities and Collateral Equipment, explains, “RCM seeks the optimal mix of Condition-Based Actions, other Time- or Cycle-Based actions, or a Run-to-Failure approach…it is an ongoing process that gathers data from operating systems performance and uses this data to improve design and future maintenance. These maintenance strategies, rather than being applied independently, are integrated to take advantage of their respective strengths in order to optimize facility and equipment operability and efficiency while minimizing life-cycle costs.”

The benefits of implementing RCM in the data center environment are compelling  and have already been proven in the even more demanding maintenance environment of the airline industry.

For data center managers, RCM benefits include:

  • Improved Performance. Maintenance organizations can focus on the most critical equipment elements, with shorter work lists reducing extensive and costly shut downs. There are fewer burn-in problems from unnecessary replacements, and more efficient identification of unreliable components.
  • Higher Quality. The discipline of the RCM process results in a better understanding of equipment capacity and capability, a better set of equipment set-up specification and requirements, confirmation or redefinition of equipment-operating procedures, and a clearer definition of maintenance tasks and objectives.
  • Cost Effectiveness. Using an RCM approach dramatically reduces unnecessary routine maintenance, helps prevent or even eliminate expensive failures, and introduces defined decision guidelines for acquiring new maintenance technology. The NASA RCM Guide states, “The flexibility of the RCM approach to maintenance ensures that the proper type of maintenance is performed on equipment when it is needed. Maintenance that is not cost effective is identified and not performed.”
  • Life Cycle Costs. RCM reduces life-cycle costs by optimizing maintenance workloads and providing a clearer view of spares and staffing requirements. It also enables organizations to realize a longer useful life of expensive equipment items through the use of condition-based maintenance techniques. The NASA RCM Guide explains, “the cost of repair decreases as failures are prevented and preventive maintenance tasks are replaced by condition monitoring. The net effect is a reduction of both repair and a reduction in total maintenance cost. Often energy savings are also realized from the use of PT&I techniques.”
  • Enhanced Maintenance Data. RCM provides a better understanding of equipment in its operating context, which in turn leads to more accurate and complete drawings and manuals, and allows maintenance schedules to be more adaptable to changing circumstances.


Currently, data center managers are unknowingly introducing risks into their environment by following traditional, overhaul-oriented maintenance regimens. The conservative nature of data center operations has stalled the adoption of proven methods from other industries. However, with the right combination of analysis, tools, and procedures, an RCM program can be implemented smoothly to reduce cost and risk.

In the data center industry, where 86 percent of downtime is due to infrastructure failure and human error (according to a study commissioned by Sun Microsystems), an effective RCM program can have significant impact.

“RCM yields results very quickly; most organizations can complete an RCM review on existing equipment and achieve substantial benefits in less than a year. It is also an ideal approach for determining the maintenance requirements of new equipment of all kinds. When applied correctly, it transforms both the maintenance requirements themselves and the way in which the maintenance function as a whole is perceived,” said Lee.

Most significantly, past experience has demonstrated that data centers can realize substantial savings in maintenance costs. The NASA RCM Guide reports that “savings of 30 percent to 50 percent in the annual maintenance budget are often obtained through the introduction of a balanced RCM program.” As energy, labor, and operating costs continue to increase in the data center industry, there is a compelling case for the adoption of RCM.