Why do data center owners consistently miss their greatest opportunities to enhance the potential for continuous operation of their facilities’ systems? The answer may be simple: they do not see examples of comprehensive, yet relatively simple, facilities operations programs when they visit other facilities in their industry. In addition, the issue is rarely discussed at industry conferences. Most conferences and data center tours focus on equipment technology and design configuration, with emphasis on what is new and what level of redundancy is selected.
This emphasis is understandable but fails to address the most frequent cause of facilities-related downtime: human error. Industry data, such as that published in Symantec’s 2008 State of the Data Center Report, the January 6, 2006, edition of Processor, and Mission Critical’s January 2009 webinar survey consistently demonstrate that human error accounts for the vast majority of facilities-related data processing interruptions. Very few data center owners have actively used this information to ensure continuous operation in their facilities. Assessments of over 100 data centers demonstrate that facilities-related downtime is most often caused by inadequate staff coverage, a missing procedure, an incorrect procedure, or inadequate training. Advances in reliable designs and equipment over the years have reduced the number of interruptions to data processing caused by failures of facilities systems and components.
As a result, the amount of investment needed to achieve an optimal operations program is far less than that made in the building and systems installed. Likely, it is only the industry’s lack of awareness and focus on facilities operations opportunities that creates this dichotomy.
The astounding rates of success (multiple years between accidents) in industries where life safety is the focus demonstrates that continuous manned coverage, combined with thorough procedures, and rigorous training programs are the key to success. Not only do those who operate nuclear submarines attend multi-month specialized classroom training prior to working on a submarine they spend 14 to 18 months “qualifying” to operate the various systems once they are onboard. According to Matt Beckman, former Navy nuclear propulsion plant supervisor, submarine crew members are initially provided an orientation to the ship and a systems overview. From that point forward, they participate in two to three hours of training each week. Only after a two-year period do they begin to train others, and only after three years are they normally qualified to train others on all systems. A nuclear submarine team literally operates with over a thousand procedures, most of which are written by the individuals who design the systems.
By contrast, data center facilities team members are typically brought on board just before a new facility is complete, with little knowledge of the construction process. They may have had some other critical facilities experience, but most likely will not. Unless the owner has transferred several team members from another company facility, all will be learning the company culture and objectives while trying to learn to operate the new facility’s complex systems.
Most commonly, their only training will occur as part of the commissioning process, meaning it will be informal and rushed (in order to meet the promised start-of-operations date). Design engineering and commissioning consultants repeatedly report the prevalence of this scenario (which the author has also witnessed first-hand as a facilities manager and as a consultant).
Typically few, if any, of the team members will have a chance to operate the equipment as each system is tested. After the hectic commissioning period, some owners will take advantage of manufacturers’ offers to provide general training. Ordinarily, this happens when the facilities team is still new so they do not much retain much of the training. In addition, this training is rarely site-specific. Unlike submarines, ships, or aircraft, which have a limited number of models, each data center facility is unique. The configuration of the systems will vary even when several facilities are simultaneously designed to a “standard.” Regional variances in equipment availability as well as building site variances make this an unavoidable reality. As a result, manufacturer provided training is mostly beneficial as a systems overview.
One owner of a new data center recently provided a two-month window for training the facilities team between the end of commissioning and the start of operations. This critical manufacturing organization also had the foresight to procure a thorough set of site-specific procedures that were utilized during this extended “hands-on” practice period. As part of this effort, the manufacturer gained the additional benefit of testing and editing each of these procedures. All this was achieved before introducing any risk to data center operations. Companies that utilize this approach have found the experience invaluable for their facilities’ teams. They retain much more knowledge as a result of this extended practice, which enables them to better respond to and resolve unexpected incidents before they result in downtime. And they will have a better chance of seamlessly performing system “changes-of-state” (transferring equipment offline and back) during planned PM activities.
Unfortunately, most owners consider scheduling this amount of time for testing and training as an unaffordable luxury, as they need the new computer operations space immediately. Unexpected growth in processing demand and/or a delayed new facility project approval process are the most common causes for this haste.
Without a substantial testing and training window, the average facilities team will remain uncomfortable with the data center’s systems and their normal operation until many months after start up. Several owners who have inherited this challenging situation have helped to accelerate the learning curve by ensuring their staff’s participation in all planned preventive maintenance activities, so they may observe and practice the steps necessary to transfer equipment “offline” and to restore a system to “normal.”
This repeated experience, combined with the creation of site-specific procedures, permits facilities teams to acquire ownership of these critical change-of-state activities, rather than relying on the system service vendor to perform or lead these activities. Because of the variance in installed equipment configurations from one customer facility to another, reliance on vendor technicians for system transfers presents an increased risk.
Practicing system change-of-state procedures is just one method successful organizations have implemented to hasten the acquisition of knowledge when the luxury of a dedicated training window is not possible. Additional strategies include: