When new facilities start to come online, project participants are often surprised by a particular series of events. These events usually unfold as disparate systems start to integrate, but things don't go as planned. Let's relive one such integration process and see how things can go awry, despite the best efforts of all parties.

This particular scenario also raises questions about the rate of temperature rise in failure mode and the survivability of high-density operations during a loss of cooling.

Site Description

Our scenario unfolds in a new data center with:
  • A pre-action sprinkler system zoned by area
  • three-foot (ft) high raised floor
  • 16-ft finished floor to a finished acoustical tile ceiling
  • Standard CRAC units distributed throughout the room
The site is complete, and all systems are in test mode. The owner has done an early move in of some network gear and racks.

The Event

Saturday morning: As a precaution, prior to load bank testing the installed PDUs, the electrical contractor shut the main sprinkler valves on the pre-action sprinkler system zone and electrically isolated the main fire pump. The load banks are set-up inside the data center and are connected to a 300-kVA PDU

The building controls contractor is working on the chiller controls remote from the data center space. The chillers are not on-line, but the CRACs are running.

The, electrician begins a full load test of a single PDU. Within minutes the data center begins heating up. The project manager goes to check about the cooling

The combination smoke/heat detector soon goes into alarm, releasing the pre-action solenoid. Immediately a 155 degree F sprinkler head releases, allowing water to flow.

The test stops, and all hands address the sprinkler system to minimize damage.

Analysis

My years of data center post mortems lead me to conclude that there is always a series of events that come together at a single point in time to cause failures. If any one of these events had not taken place when it did, then the failure scenario most likely would have been avoided.

Testing is the deliberate and intentional effort to learn the limitations of a facility before it goes live. Testing identifies the shortcomings of design, the errors of installation, and highlights training needs for the operations staff so that they can be mitigated before the data center becomes operational.

In this scenario, testing achieved all of these criteria.

The testing crew had good intentions when it isolated the pre-action system, but it neglected to isolate the jockey fire pump. The controls and electrical testing contractors failed to foresee the lack of chiller capacity at that point in time, and no one predicted that load bank testing a single 300-kVA PDU in a 20,000 square foot room with 16-ft clear height and 3-ft high raised floor would be an issue.

Result

The +300° F horizontal discharge output of the two in-room temporary load banks connected to the 300-kVA PDU stratified at the 16-ft high ceiling, causing the heat function of the combination smoke/heat detectors to activate, followed immediately by the melting of the 165¡F sprinkler head resulting in a water flow onto the floor.

Lessons Learned

  • Use a fire system expert to isolate the fire system, not an electrical contractor.
  • The design called for smoke detectors on the pre-action system, but the fire contractor furnished combination heat/smoke detectors, which were the latest "technologically advanced" detectors for the pre-action system. The contractor never programmed out the heat function from the pre- action alarm circuit.
  • In a high-bay data center, heat will stratify at the ceiling when the CRACS are only drawing return air from 7-ft AFF.
Many things could have been done differently and produced a different outcome; however' our focus is on the basic issue of rapid heat rise.

Data center mechanical systems, like any man-made system, will eventually experience a failure. Who is:
  • Taking a second look at the pre-action sprinkler systems, given the rapid rate of temperature rise in high-density data centers?
  • Calculating the time it takes from HVAC failure to melting of a sprinkler head?
  • Evaluating the sequences of operation and validating the application of the new advanced "improved" designs?
Even if the solenoid valves are not activated, this incident clearly demonstrates that the sprinklers can be released allowing the pre-action sprinkler system to be de-pressurized and at the very least result in residual water to be released into the computing hardware.

Is this problem lurking as a surprise to operators as data centers are populated and approach their design capacities? Let's hear from you.

Also, I would like to hear from the design community as to how you calculate the rate of temperature rise in a failure mode.