The data center industry continues to face growing demand for reliability and high availability on both the IT and support infrastructure systems (mechanical, electrical, and fire protection — MEFP), a trend that will not diminish in the foreseeable future. Best practice, continuous process improvement, and change management programs are implemented or reinvented to meet these demands. Extensive MOPs, SOPs, and ESOPs are crafted to ensure the “bad actors” are reduced or eliminated. Changing the culture within a data center organization to develop attitudes of all stakeholders toward high availability remains a priority at the C-Level. Those C-Level executives and their operations managers; interested in achieving cultural change, may best be served by adopting what the aviation industry implemented in the 1960s — reliability centered maintenance (RCM). The aim of RCM is sustained data integrity and availability of all systems: IT and MEFP.

 

RCM BACKGROUND

In the 1960s, United Airlines engaged a number of their engineers to assess the efficacy of their preventative maintenance program for their new fleet of 747s. This group, the Maintenance Steering Group (MSG), published the MSG-1 Handbook establishing RCM on the 747 commercial airliners, ensuring high availability and safety of the public using airline travel.

In 1978, Stanley Nowlan and Howard Heap published a report titled Reliability Centered Maintenance after an exhaustive study of failure modes and effects of airplanes updating earlier RCM techniques for optimizing maintenance of complex systems. In 1983, Stanley Nowlan collaborated with John Moubray delving deeper into RCM practices resulting in the 1991 publication of Moubray’s book titled, RCMII – Reliability Centered Maintenance.

Moubray began to develop a suite of training and support services designed to transfer the technology of RCM to industrial clients. This led him to found Aladon Ltd in 1986 and Aladon LLC in the U.S. in 1998. RCM2 is defined as a process used to determine what must be done to ensure that any physical asset continues to do what its users want it to do in its present operation context.

The goal of RCM is consequence mitigation rather than failure avoidance. The automotive industry embraced RCM in the late 1980s. Dr. Klaus Blache’s Reliability & Maintenance Team at General Motors was involved with Ford, Chrysler, Boeing, Caterpillar, Pratt & Whitney, Rockwell International, and many other contributing organizations to create a reliability and maintainability guideline. The result was a 1993 publication by the National Center of Manufacturing Sciences, Inc. and the Society of Automotive Engineers (SAE). It was titled Reliability and Maintainability Guideline for Manufacturing Machinery and Equipment (publication M-110). In 1999, The Society of Automotive Engineers (SAE) issued SAE JA1011, Evaluation Criteria for RCM Processes — establishing criteria for RCM processes (NAVAIR and Aladon/John Moubray major contributors). In 2002, SAE issued JA1012, A Guide to the RCM Standard, amplifying and clarifying key RCM concepts and terms from SAE JA1011.

 

WHY RCM FOR IT SYSTEMS?

Applying the RCM methodology to IT systems is a complement to change management. The rigorous approach to developing an operating context, completing an information worksheet with functions, function failures, failure modes, and failure effect so as to complete a decision worksheet for reviewing the failure consequences and finally identifying proposed or default tasks for reducing or eliminating the consequences of the failure, all reinforce the foundation of change management. Moreover, RCM will likely result in cultural change leading all participants to improved organizational performance outlined below.

Remember that United Airlines on July 8, 2015, some 50 years after establishing RCM for their aircraft, grounded hundreds of flights because of computer problems on their ground-based IT network and not because of a failed system/component on their planes. Computer experts say the problems could be blamed on the use of larger and more complicated computer systems that are not supported with sufficient staffing, testing, or backup systems.

And, a Wall Street Journal article offered the following regarding the outage, “Today’s problems with reliability are more fundamental, a reflection of the complexity of contemporary networks, the volume of data, the pace of change, insufficient organizational and cultural practices, and a legacy of arcane and poorly written business software that traditionally put little emphasis on usability…”.

While the author in no way places the safety of lives in aviation on par with data center availability, recent operational interruptions, data breaches, and natural and man-made threats/disasters have had significant impact on lives because of the loss of data. Electricity grids, credit cards, social media, communication networks, and public transportation all have become indispensable to everyday modern life. The RCM methodology, developed for complex systems particularly mechanical, is applicable to the complex systems/processes comprising IT networks for mitigating the consequences of failure.

Furthermore, many regulatory boards and standards institutes have developed requirements and guidelines for data integrity. Table 1 provides a list of regulatory and compliance standards which set minimum requirements for sustaining business operations, disaster recovery, business continuity management (BCM), and information and communication technology (ICT) continuity.

The list in Table 1 demonstrates there is little lack of importance for sustained data integrity; business continuity planning; disaster recovery and ICT continuity all of which require a reliably designed and maintained IT system and support infrastructure. Consider a centralized network where a simple loss of connection between the server and clients is enough to cause a failure, but in P2P networks the connections between every node must be lost in order to cause a data sharing failure. In a centralized system, the administrators are responsible for all data recovery and backups, while in P2P systems, each node requires its own backup system. Each network has its advantages and disadvantages, along with failure mode and failure consequences.

Leveraging the operating context required in an RCM analysis allows a team of IT operators, maintainers, and external subject matter experts to rigorously analyze a network. This helps us list functions, functional failures, causes of failure, and failure effects.  Finally, we utilize a decision matrix as a focusing tool to review the failure consequences and determine proposed task(s), which are intended to reduce or eliminate those consequences of failure.

Consider, again, a disk array where the applications utilizing this hardware (along with statistical data on disk failure rate) could be analyzed to determine both the opportunities for failure (and possible mitigations) as well as the cost of the associated business applications should there be a failure. While most disk arrays leverage some degree of redundancy, the rigorous RCM analysis for reviewing the operating context, function, functional failures, failure modes, and failure effect results in identification of default task(s) or if a new proposed maintenance task for the array is technically feasible and worth doing.

 

IMPLEMENTATION

RCM has proven to be a leading methodology in many industries for failure mitigation, the training techniques and support services borne from years of development in the airline industry. The data center IT and supporting infrastructure of MEFP systems are poised to realize significant benefits from implementation of RCM. If the RCM process is correctly applied, it makes the following contributions to the performance of the organization:

 

• Greater safety and environmental integrity

• Improved operating performance (uptime, output, product quality, and customer service)

• Greater maintenance cost-effectiveness

• Greater motivation of individuals

• Better teamwork

• A comprehensive database (long-term asset life cycle management and financial savings).

 

How does RCM benefit a data center’s IT and MEFP infrastructure? A trained and certified RCM facilitator/practitioner leads a team of site specific operations & maintenance (O&M) personnel and external subject matter experts to assess an asset’s functions and associated performance standards.

The first requirement of the RCM process is to establish the operating context for the system, which should include the business case for the pilot analysis, the overall mission statement of the entire organization and must include a plant level, machine level, and analysis level outline. The team then identifies functional failures, failure modes, and failure consequences and finally is led through a decision process for identifying proactive tasks or default actions to reduce or eliminate failure consequences. The RCM methodology is used to determine the maintenance requirements of any physical asset, system, or process in its current operating context to ensure it continues to do whatever its users want it to do. And, when that operating context changes, the system is reevaluated to determine if it can support the new parameters resulting in revised O&M requirements, hiring/(re)training personnel, or results in a functional/physical design change.

The RCM process entails asking seven questions about the asset or system under review, as follows:

 

• What are the functions and associated performance standards of the asset in its present operating context?

• In what ways does it fail to fulfil its functions?

• What causes each functional failure?

• What happens when each failure occurs?

• In what way does each failure matter?

• What can be done to predict or prevent each failure?

• What should be done if a suitable preventative task cannot be found?

 

The strength of RCM is the way it provides simple, precise, and easily understood criteria for deciding which (if any) predictive and/or preventative tasks are technically feasible in any context and if so for deciding how often they should be done and who should do them. In addition to a preventative task’s technical feasibility, whether it is worth doing is governed by how well it deals with the consequences of the failure. If a preventative task cannot be found that is both technically feasible and worth doing, then a suitable default action must be taken. The essence of the task selections process is as follows:

For hidden failures, a predictive and/or preventative task is worth doing if it reduces the risk of the multiple failures associated with that function to a tolerably low level. If such a task cannot be found then a schedule failure-finding task must be performed. If a suitable failure-finding task cannot be found a secondary default decision is reached requiring a redesign.

For failures with safety or environmental consequences, a predictive and/or preventative task is only worth doing if it reduces the risk of that failure on its own to a very low level if it does not eliminate it altogether. If a task cannot be found which reduces the risk of the failure to a tolerably low level, the item must be redesigned or the process must be changed.

For failures with operational consequences, a predictive and/or preventative task is only worth doing if the total cost of doing it, over a period of time, is less than the cost of the operational consequences plus the cost of repair over the same period of time. If this is not met, the initial default action is no scheduled maintenance (if this is met and the operational consequences are still unacceptable then the secondary default action is to redesign).

For failures with non-operational consequences, a predictive and/or preventative task is only worth doing if the total cost of doing it, over a period of time, is less the cost of repair over the same period, otherwise, no schedule maintenance (if the repair costs are too high the secondary default action is to possibly redesign).

All too often O&M policies employ practices used for all similar assets without considering the consequences of failure in different operating contexts. This results in large numbers of maintenance schedules which are wasteful because, while not necessarily wrong in a technical sense, they achieve nothing. In fact, the maintenance task may cause the failure the task was intended to prevent. It also exposes workers health/safety as well as our precious environment to the risk of asset failure while the unnecessary task is being performed.

The comprehensive database resulting from an RCM analysis includes the operating context, RCM information worksheet, and RCM decision worksheet, all of which can be leveraged for the entire life cycle of the asset. The fact that the RCM analysis is a “living document” makes it possible to adapt to changing circumstances without having to reconsider all maintenance policies and demonstrate that maintenance programs are built on rational foundations, thereby, meeting the audit requirements of regulators and standards.

Modern data centers employ redundant components and systems designed to maximize a facility’s uptime. Component and system redundancies are intended to allow for concurrent maintainability and/or fault tolerance while sustaining IT processes. Paramount to successful operation and maintenance is understanding the operating context of the system, particularly those redundancy aspects. Consideration must be given to what parts of the redundancy are hot-standby or cold-standby and if the system being operated is maintained to maximize system uptime. If so, have the operational checks been optimized leveraging scientific methodologies which are sensible and defensible?

These redundancies within a data center are the arrangement of like components, each having similar control devices. The redundant (protective) component is configured to support the protected component. The RCM process analyzes failure modes and effects to understand the failure consequences, then asks a group of subject matter experts (SMEs) who know the asset best if the failure is evident to operators under normal operating conditions and is there a proactive task that is technically feasible and worth doing to reduce or eliminate those consequences.

On redundant systems, in many instances, there are failure modes which are not evident to the operator under normal operating conditions and may only become evident under multiple failure conditions. These are “hidden” failures requiring the identification of tasks to secure the availability needed to reduce the probability of a multiple failure to a tolerable level.

A task that reduces the probability of a multiple failure could be an “on-condition” task where there is a clear potential failure condition which is manageable. Ex: “Check unit temperature level” may be too vague. A possible alternative is: “Visually inspect Thermal unit 14-a temperature using SOP S327 (f). If above 145°F, schedule for repair at next available downturn” or “custom-developed box containing code that can no longer be modified, update operating context and identify failure mode, schedule replacement with coded device that allows for user interface at earliest downtime.”

Another task could be a scheduled restoration if there is an age at which there is a rapid increase in the probability of failure. Perhaps a task could be a scheduled discard if there is an age at which there is a rapid increase in the probability of failure.

Finally, there is a task labeled “failure-finding”  where it may be possible to test the item at a practical interval that reduces the probability of a multiple failure to a tolerable level. This involves a statistically proven methodology employing reliability data to establish a practical interval, particularly for failures involving safety and environmental consequences. For consequences with operational and non-operational consequences, the methodology employs cost criteria for optimizing the interval. This is particularly instrumental for informed decision making of intervals for standby generator systems, redundant cooling units, and associated infrastructure components. For IT systems, consider analyzing critical path communication device mean time between failure (MTBF), and evaluate spare parts or replacement devices along with fail-over plans.

All too often many systems designed to operate as standby (cold) redundant are actually operated as parallel (hot) redundant. Operation in this manner results in a reduced system reliability as it introduces excessive wear on redundant components. The RCM process demonstrates there can be proactive tasks identified to maximize system reliability where redundant components are treated as designed and maximum life-cycle management is realized. In short, RCM identifies the “safe-minimum” work necessary to ensure sustainably safe and economical operation of all data center assets.

Moreover, in the era of demand for high availability and lower energy consumption it would benefit operators to review design drawings, validate operating context, and ensure redundant systems are optimized regarding operation, testing, and maintenance. These are completed under an RCM analysis and failure-finding task interval analysis aids in eliminating legacy intervals which allows all stakeholders the opportunity to employ reliability data for particular equipment and make informed decisions on proactive maintenance task as well as optimizing those tasks for greater maintenance cost effectiveness.

 

CONCLUSION

What’s old is new again means the time is now for a paradigm shift by senior managers to engage RCM. It is ideal, if not critical, given the demand for highly available data centers. The RCM paradigm is a distinct set of concepts and methods that constitutes legitimate contributions to the entire data center environment allowing all stakeholders to realize a reliability-based culture and thought process. The 50-plus years of airline industry implementation of RCM has demonstrated that it is a rigorous method for reducing and eliminating failure consequences of complex systems. RCM has been employed by few operators of MEFP systems within data centers, but those who have benefited from improved availability.

The complex structure of IT systems will benefit from the rigorous analysis process of RCM for enhancing change management techniques. The support infrastructure of MEFP systems will realize improved reliability and availability by nature of optimized maintenance procedures, accurate spares analysis, and a thorough vetting of the operating context to ensure capabilities. A holistic engagement of IT and MEFP systems through an RCM analysis enhances site-wide reliability and high availability. The data center industry has experienced a substantive maturation process around design and commissioning. RCM implementation embraces the long-term role of operating and maintaining assets. Finally, the RCM process will deliver documented O&M procedures around a detailed operating context for the operators of the asset and is a transferrable body of knowledge for future operators thereby demonstrating a sustainable business practice.

 

NEXT STEPS

The University of Tennessee Knoxville Reliability & Maintainability Center offers on-going reliability & maintainability training courses and has established a R&M Data Center Boot-Camp (August 16-18, 2016) in partnership with Oak Ridge National Lab Supercomputing Complex. And, The Aladon Network offers RCM training courses to introduce the concepts and develop in-house facilitators to lead an organization into sustainable reliability centered maintenance practices. The business management decision to embrace RCM to increase a data center’s reliability and availability may be the best decision by senior managers.