Data Center Emergency Preparedness
Expecting the unexpected.
The dictionary defines emergency as “an unexpected and sudden event that must be dealt with urgently.” It defines preparedness as “readiness for action.” So, emergency preparedness is being ready for sudden, unexpected events. Unless there’s a crystal ball handy to predict the future, how does one prepare for the unexpected?
The key is to expect the unexpected and to focus on the symptoms vs. the causes. The first task should be to identify the potential hazards that a site is reasonably expected to be exposed to based upon geographical location. A site in Chicago should consider ice storms and blizzards whereas a site in Miami should consider hurricanes and hail. A site in southern California should consider earthquakes and wildfires whereas a site in Kansas City should be concerned with severe winds and tornados. There are good reasons that site selection and due diligence are given the importance they are when considering where to build or purchase a critical facility.
CATEGORIZING THE HAZARDS
Next is to categorize the hazards or “events” into two categories. These would be events that typically occur with 24 hours or longer advance notice and those that could have less than an hour advance notice. Hurricanes can be predicted days before they reach a given location, whereas tornados may give little to no advance notice whatsoever. Most disasters that occur with 24 hours of advance warning also tend to affect a large region. These events can incur wide spread destruction and disruption of critical off-site services that can take days or weeks (or in extreme cases, months) to restore such as the electric power grid, municipal water systems, communication systems, and roads and bridges.
On the other hand, most disasters that occur with little or no warning typically affect a much smaller area. Tornados; severe thunderstorms; wind, hail, and localized flooding; and hazardous gas or material spills are good examples. As destructive as these events may be, the relatively small area affected allows for a much quicker response and recovery of affected critical services; typically within 24 to 48 hours. However, some events such as a major earthquake have the potential to occur without notice and affect wide areas.
So far, the hazards discussed are somewhat static in nature, in that hurricanes, earthquakes, tornados, and river floods can be assessed and planned for in advance and chances are the threat doesn’t change much over time. Some hazards are more dynamic. The risks associated with disaster events such as transportation accidents releasing hazardous materials, construction related hazards (both on-site and off-site), and terrorist attack-related events should be re-evaluated on a periodic basis or when the likelihood of initiating conditions change.
Trying to create unique procedures and processes for each and every situation is not necessary. This is where the symptoms must be treated, not the cause. The first step is to identify which systems, resources, and services are critical to sustaining operations, and then determine how these are exposed to risk by the events identified above. Some obvious criticalities are power (off-site utility and on-site generation), water (domestic “city” water and on-site storage and/or wells), and of course staff (on-site staff but also the fuel oil supplier’s staff, etc.).
LONG-TERM VS. SHORT-TERM
Back to the two categories: long-term events with advance notice vs. short-term events with little or no warning. Let’s discuss power as an example. The first step should be to assess the reliability of the utility service. Is there one feeder or two? Are they from separate off-site substations or the same? Is the site close to the point of generation or way down the line? Are the feeders overhead or buried under ground? Regardless, most sites will provide some on-site generation to allow site power autonomy when the utility power eventually fails.
For on-site power generation, ensure that at least a minimum amount of fuel oil is available to carry the load for the short-term event. This is frequently set at 24 hours with the expectation that events that last longer allow time to contact a fuel oil delivery service to have additional fuel oil provided when necessary. There is a shortcoming with this strategy. What happens when the event affects a larger region? Large scale competition ensues as everyone with a diesel generator starts calling for emergency fuel oil deliveries. What if the fuel delivery service doesn’t have power to pump the fuel into the tanker trucks? What if the roads are flooded between the fuel oil supplier and the site? What if the delivery truck driver stays home handling his own personal disaster? What if the fuel oil delivered is of poor quality?
Part of good emergency planning is to eliminate as many “what ifs” and take as much control of the situation as possible. In this case, there are a number of proactive steps that can help. First is to have a formal service level agreement (SLA) in place with a reputable fuel oil supplier who also provides delivery service (eliminate the middle man if possible). The SLA should include guaranteed delivery including priority over other potential clients. Better still is to have multiple SLAs just in case a potential supplier becomes incapable of meeting the contracted commitments. These separate suppliers should be in separate locations with independent paths to the site so one road or bridge closure doesn’t impact all deliveries.
For potentially large, and therefore, long-term events, consider pre-staging extra tankers on-site so runtime before the first delivery is extended. This means having a designated pad to locate the tankers that affords the ability for on-site staff to transfer the fuel oil to the permanent bulk oil storage tanks. This also means the tankers should include transfer pumps or the site has to have its own suitable capabilities. Imagine how frustrating it would be to have the needed fuel oil on-site without the means to make use of it! And in case of poor fuel quality, each generator should have 100% capacity, redundant duplex fuel oil, and water separator filters.
All of this great planning will be for naught unless there are trained and competent personnel on site for the duration of the event. This takes serious advance planning as well. If the event is of such magnitude and severity (e.g., Hurricanes Katrina and Sandy) as to last more than a few days then it should be assumed site staff will have family, homes, and other personal concerns that compete with the site for attention. Even the most dedicated and loyal employee has to be able to care for his home and family. It becomes imperative that employers address these concerns to the extent possible in advance. The site should be stocked with provisions sufficient for staff to essentially live on site when required. Safe transportation should be provided to shuttle staff home and back (e.g., four-wheel drive trucks with chains during blizzards). Hotel accommodations should be made for essential staff’s family especially if special needs come into play.
ESTABLISH STANDARD OPERATING PROCEDURES
With regard to planning for the truly unexpected, unanticipated emergency that affords little or no advance warning, that’s where standard operating procedures that include emergency operations and response coupled with staff training and drills provide the best protection. Emergency procedures aren’t just for equipment or systems. They should include “severe weather/high wind” procedures. This could include having security monitor the emergency weather service for warnings and watches. Activation of the procedure could include establishing a “stand-down” posture where all critical work is halted or deferred until the situation clears.
Consider opening a “war room” where site operations management monitors the site infrastructure, places calls to vendors and contractors to be ready to mobilize if required, and dispatching on-site staff to police the site and ensure doors are closed and sealed (and diked if prudent), potential wind-blown projectiles (discarded sheet metal, plastic, etc.) is secured from blowing into outdoor louvers, cooling towers, etc., and that roof- hatches are locked and secured. Other such procedures could include “flooding/roof leak,” “HAZMAT event,” “severe outdoor contamination (wildfire, dust, ash, fumes, etc.),” and other viable events.
It is important to note of course that in many of today’s critical operations, mission critical redundancy is provided by IT redundancy (virtual redundancy). Some sites have low-latency, mirrored-redundant processors in sites within relatively close proximity. If both sites are in jeopardy, operations must be transferred to a remote site that is not in the same geographical area. Business continuity plans should be in place and tested annually to transfer all critical operations to the remote site until the event has passed and normal operations restored.
A good practice is to have a flood response container that includes a wet-vac, plastic tarp, absorbent socks/booms, flashlights, duct-tape, clean rags, etc. Larger sites may have multiple kits staged in strategic locations. The contents should be inventoried and checked routinely as a standard preventive maintenance task to ensure availability when needed. Similarly, other pre-packaged emergency response kits should be maintained as appropriate. Most have a HAZMAT spill response kit, but you can also have kits for emergency lighting (small generator, approved gasoline container, light “trees,” etc.), and outdoor contamination kit (activated carbon filters, temporary pre-filters, or a temporary means to seal outdoor intakes and operate the site HVAC in a total recirculation mode).
Finally, whatever preparations and planning are put in place should be supported by formal staff training and real-life drills. These plans must be tested and verified as safe and effective before being relied upon. Emergencies are by their very nature high stress events. Staff should have confidence that the plans and procedures work based on practice and demonstration. Effective disaster planning and training builds a more effective operations staff and strengthens the interdepartmental communications between facilities, IT, and other business units as a whole. The site staff develops a bigger picture and how the facility systems work together and how to protect the enterprise. This planning allows for a more effective response to day-to-day operations and to the truly unpredictable emergency event.
The author would like to acknowledge Ethan Thomason, vice president at Primary Integration, for his help with this article.