In the data center industry, equipment redundancy is widely utilized to achieve high system availability, often required to be in the range of 99.999% (five nines). However, the required level of redundancy is dependent on equipment reliability.

For instance, an N+1 system cannot achieve high levels of availability with unreliable system components that have too high of a probability for simultaneous failure. Reliability affects availability, but they are not the same thing in a world that experiences downtime and failures. Reliability also impacts operating costs. More downtime equals more maintenance and repair spending.

Reliability is defined as the probability that an item will perform its intended function for a specified interval under stated conditions. Regarding reliability, there are some important questions to ask, including, but not limited to, the following:

  • Does your data center use reliability centered maintenance (RCM) concepts to optimize your maintenance efforts?
  • Have you completed an equipment criticality analysis?
  • Do you track mean time between failures (MTBF) routinely?
  • Have you optimized your preventive maintenance (PM) plans?
  • Do you track equipment failures and improve your processes accordingly?

The Goal: Minimize Expenses, Maximize Reliability

In today’s competitive market, operating expenses must be minimized without sacrificing reliability and uptime. Many data centers develop their critical equipment scopes of service solely based upon OEM service recommendations. While this may produce adequate results, it is not typically optimal. Many times, these recommendations cater to the best interests of the service organizations, not the end user. In fact, there are often better methods that use RCM principles to improve reliability while lowering cost.

While RCM programs have proven to be effective, they can be expensive and resource-intensive. They involve creating a detailed failure modes and effects analysis (FMEA) and populating decision worksheets, which require specialized knowledge and can be very time-consuming. With this in mind, implementing a comprehensive RCM program in a data center is not generally cost-effective. On the contrary, implementing a PM optimization program that employs key RCM elements and historical information about common failure modes is a tactic that has proven economical and effective in other industries and offers a good model for data center adoption.

The figure below represents the widely known probability of failure curve (P-F curve) with preventive and predictive maintenance strategies.

The P-F curve is a basic RCM fundamental that can be successfully employed without completing an exhaustive analysis. Many such reliability tools can be implemented to significantly improve the condition and useful life of assets as outlined below.

The Solution: Implement a Reliability Program

In 2017, RagingWire made the decision to strategically implement a reliability program for its data centers. The company started by hiring a reliability engineer with a manufacturing background.

The initial reliability initiatives included:

  1. Scopes of Service
    1. Developed for 81 categories of equipment.
    2. Input was OEM recommendations and codes by governing organizations (IEEE, ANSI/NETA, ASHRAE, NFPA).
    3. List of equipment included support equipment, such as forklifts, pallet lifts, elevators, lightning protection, overhead doors, dock levelers, gates, and water supply system.
    4. Used to create task lists for all equipment and set up in a computerized maintenance management system (CMMS) for PM plans.
  2. CMMS
    1. Standards developed and documented.
    2. Program redeployed to remove information on screens that was not used or needed.
    3. Added reliability fields, such as failure, cause and remedy codes, and useful life.
    4. Entered corrective work orders for internal and external work activities.
    5. Conducted training across the company on the implemented changes.
    6. Established a training matrix for ongoing annual training and for new hires.
    7. Established an advisory team that meets monthly to discuss employment and changes that would enhance the program.
    8. Created a detailed user’s guide.
    9. Set up environmental health and safety (EHS) periodic requirements to ensure accomplishment.
  3. Reliability
    1. Roadmap created with responsibility assigned.
    2. Created a reliability steering team.

*Note: The reliability pyramid is a great resource and can be found at

  1. Cost savings
    1. Reduction team was created, including engineering and operations personnel.
    2. Purchasing team negotiated national agreements for major equipment and expenses.
    3. Annual savings of $250,000 realized for utilizing the scopes of service.
  2. PMs
    1. Established an oil analysis company for diesel generators and transformer oil using on-line reporting.
    2. PM optimization process implemented with FMEAs for critical equipment
  3. Asset management
    1. An asset defined and list created.
    2. Equipment hierarchy defined.
    3. Equipment criticality established.
    4. Determined maintenance strategies: predictive maintenance (PdM), PM, failure finding, redesign, run to failure.
  4. RCA
    1. Program developed with approved policy and detailed procedure.
    2. RCA software selected to solidify the process.
    3. Training conducted with selected engineering and operations personnel.
  5. Documentation created: PM optimization policy and procedure, thermography policy and procedure, predictive maintenance policy, oil analysis policy and procedure, motor circuit analysis policy, vibration analysis policy, and CMMS employment policy.
  6. Workflow established for work order processing.

The Future: More Initiatives to Come

Down the road, more initiatives are planned, including:

  1. Procedure for determining equipment expected useful life to aid in developing the capital plan.
  2. Establishing PdM and condition-based maintenance (CBM) programs.
  3. Using reliability key performance indicators (KPIs) for identifying opportunities for continuous improvement.
  4. Creating a storeroom management program for proper identification of spare parts needed on-site and storage for easy access.

Typical benefits expected to be realized from reliability programs include a reduction of equipment failures and maintenance costs, improved work order efficiency, increased asset useful life, and a safer environment due to reduced risk from equipment maintenance.

In addition, some side benefits include capturing equipment history for asset management and annual budgeting, systematically eliminating root causes of experienced failures, and evaluating maintenance activities for continuous improvement opportunities.

RagingWire has experienced cost savings and improved work efficiency from its new reliability program. It is expected that capturing failure data and improving the maintenance process will continue to enhance expected useful life of assets, thereby reducing capital expenditures. Key metrics are being tracked to ensure that expectations match results. Focusing beyond the inherent redundancy in data centers by prioritizing reliability is a major step toward the goal of becoming the low-cost provider.