Spring is here, and you and your data center managed to survive this past winter’s harsh weather without any major impact. In fact, you just completed all your scheduled semi-annual preventive maintenance (PM) procedures. You are feeling comfortable and about to go away for a long overdue three day weekend. Suddenly the inbox on your smartphone explodes with a torrent of emails and text messages as hundreds of alarms make it very clear this was not going to be a good day. Sure enough, two minutes later the calls start coming in from various IT managers and your facilities team.
It takes you a few minutes to scroll through all the alerts to understand what just happened — one of your data center facilities suffered a catastrophic failure in a portion of the power chain. Now it seems that you are getting so many incoming calls that it’s very difficult to reach the key personnel on your team to find out exactly what happened. Of course being an experienced professional you stay calm, since your site has redundant uninterruptible power supply (UPS) power and cross-system tie-points. You think that someone on your staff may know what to do because last year the UPS service technician doing PM showed some of your team how to bypass the UPS and use procedures to cross-feed the other side of the power distribution system from the remaining operational system.
You also realized that you had a set of documents that were in your office when you were hired several years ago and they are probably still on the on bookshelf. The three binders labeled “Standard Operating Procedures” (SOP), “Methods of Procedure” (MOP), and “Emergency Operating Procedures” (EOP), were written when the data center was originally built 22 years ago. The SOP has been periodically reviewed and updated from an administrative duties perspective, primarily because your human resource and legal departments required it for compliance purposes. However, the MOP and EOP were untouched. You also realize that all the UPSs (and their respective bypass panels) were replaced three years ago, but the relevant details and procedures for the new equipment were never documented.
After a few minutes (which seems like an hour), you finally reach your site supervisor. Unfortunately, it seems that Joe, the primary house electrician who would know what to do, has already left for the weekend. It turns out that the new person on duty was not there when the bypass and transfer procedures were demonstrated by the service technician during the last PM. Your day, not to mention your long weekend and perhaps your career is definitely not looking very good at all… .
Limited training budgets also results in fewer personnel with critical system expertise, which can lead to a classic form of denial — typically manifesting itself by managers keeping their head in the sand.
You can use your imagination to guess what happens after that. While the scenario above was not exactly true, it represents an amalgam of different events and circumstances that have occurred over the years. Fortunately, the majority of people in the mission critical world may have spent their entire career without being involved in an actual system failure — much less being put into a position of having to directly respond to a critical incident.
Moreover, in general, new equipment offers higher reliability, and coupled with redundant systems, resulting in less disruptions or total outages. Availability has also been enhanced by the reliance on the investment made in automatic system fault-tolerance and presumed availability in the event of a component failure, such as the loss of single UPS module.
Fault tolerance in most modern systems is based, in whole or in part, on some form of software-driven control system to govern the failover sequence of events should some component in the electrical or mechanical system fail. Nonetheless, there can be circumstances where it does not all go as planned and human intervention may be required to mitigate or rectify a problem. Of course, most systems are operational 7x24, and if properly maintained, rarely exhibit major problems that would require immediate intervention by operations staff. While highly reliable systems are what we expect and strive for, the downside is that it can lead to complacency, by both management and operational personnel. This in turn may mean that staff personnel may not have any idea what needs to be done, if an emergency occurs.
Virtually every organization does some form of regular maintenance on their UPS, and on items which are normally off-line stand-by systems such as back-up generators where regular start-up tests and exercising, as well as periodic service is performed.
However, in most cases this typically involves disarming the automatic transfer switch (ATS) instead of disconnecting from the utility and automatically tripping the ATS to do a live transfer of the critical load. Some facility managers feel this is unnecessary or risky and may even avoid doing this type of test even on an annual basis, or perhaps entirely, out of fear that it may fail and they would be blamed for an outage caused by the test. I have heard several stories and seen situations where the generator always started during the monthly tests, but when an actual utility failure occurred, that while the generator started and ran however, the ATS did not transfer. A complete utility disconnect test with full technical support present would have discovered this problem without dropping the critical load and avoided the actual outage.
In many organizations, PM is done by outside field service engineers from various equipment manufacturers or third-party service providers. For example, the maintenance bypass of the UPS is typically performed by a field service engineer as part of a PM visit. In some cases, the engineer may have been escorted by someone from the staff or perhaps just the security department.
However, it is less likely that the operational staff will observe or participate in the actual bypass or de-energizing of the UPS. In addition, it is unlikely that they will bring the house SOP-MOP-EOP documents with them to ensure that the documented procedures are being followed or are explained clearly and correctly. However, in some large well managed organizations, in-house staffers utilizing the MOP during PMs are a mandate to ensure that documents are correct and the outside technician is fully aware of all the related procedures.
Moreover, some of those staffers who could potentially be called on to respond during an emergency may not be comfortable or willing to affect the steps required. Some of the procedures can be especially intimidating if it involves flipping the handle on large high-current circuit breakers in a maintenance bypass panel (or other large switchgear) and they have never actually seen it done or practiced performing it themselves. Equally as important is to stress that if they are not properly trained — they should not do anything that they are not qualified for during a crisis.
Besides the assumption that automated systems will react perfectly should a component fail, employee productivity and economics come into play. One common scenario: “There is not enough time or budget to properly train my staff and update all my documentation.” Limited training budgets also results in fewer personnel with critical system expertise, which can lead to a classic form of denial — typically manifesting itself by managers keeping their head in the sand, since “Joe” will know what to do in an emergency; he was here when they built the site. This is especially true in smaller organizations, but also some large ones as well. This outlook places a great reliance on the skills and knowledge of a few key personnel.
However, in many cases, “Joe” (it seems almost every organization has a “Joe”) is reluctant to transfer all that knowledge and experience, either out of concerns for job security, better compensation, or simply ego (or any combination of reasons). Unless the next smartphone app can do a brain scan and download their knowledge and turn it into well documented procedures, “Joe” will remain a critical element when dealing with an emergency. Proactive instead of reactive management thinking is critical to recognizing that resources need to be allocated for documentation and that it is readily available and updated as necessary, and that knowledge transfer and continuity is also critical. Since eventually “Joe” may leave or retire one day — perhaps without transferring that valuable information.
Consistently maintaining the SOP, MOP, and EOP documentation is essential, especially as equipment is updated and when operating procedures and polices change. In addition to updating them, making everyone aware of importance of these documents needs to be part the responsibility of every manager, regardless of size and type of data center. These procedures should be clearly written for the specific facility, rather than a just generalized list of instructions. They should include photos of the actual equipment and the controls related to the steps involved. For example, a step should read “Move the MBB handle LEFT to the ON position,” along with a close-up picture of the MBB clearly labeled “ON” and “OFF.”
It is also critical to make the documents easily accessible by producing them in digital format, as well as in waterproof hardcopy form — kept in binders located in a secure and reachable location. Moreover, highly visible notices should be placed near the critical equipment advising of the location of the MOP and EOP binders. These manuals can be created by in-house resources in conjunction with equipment vendors or by outside consultants or a combination, but in any case they should be updated as necessary.
During a crisis this can make the difference between correctly executing a step and mitigating a problem or dropping the critical load. Most critical training sessions and emergency drills should be part operational management responsibility. Moreover, while the procedures should be correctly written by technical personnel, it is during the training sessions that the staff feedback may help improve the clarity of procedures, as well as becoming familiar with the equipment. While each organization is different as to how and what resources (staff time or expenses) can be allocated to these tasks, they simply cannot afford to ignore them.
THE BOTTOM LINE
The time to see if your team can react properly to an emergency and avoid a catastrophic outage in your facility is before a real critical event occurs. It is vital to make sure senior management understands and supports this. Today’s economics are focused on greater risk avoidance and increased productivity by investing in and relying more on automation, and in some cases budgeting less for proper training for better skilled personnel. This just increases the risk of an outage and just worsens other exposures if a critical control system failure occurs.
Moreover, people can be hurt if they try to respond to a problem if they have not been properly trained and do not have access to the MOP and EOP or the documentation is unclear or incorrect. So beside the direct and indirect costs from an outage and damage to an organization’s reputation, lawsuits and even an investigation by OSHA or other government agencies can occur if the injured person was improperly trained or not trained at all.
So train your team and schedule your emergency (drills) and keep your MOP updated and readily accessible. Then hopefully, if you ever do have a crisis, you will not need a real mop and bucket or have to hastily update your resume.