Thermal Runaway: The Unexpected Killer Of Your Data Center
Avoid a cascade of failures by planning ahead.
HISTORIC COOLING APPROACHES
Historically, data centers were cooled through the use of computer room air conditioning (CRAC) units spread around the outside of the room and cold air forced under the floor. With low power usage in the racks this was sufficient to cool all the equipment. Since the advent of blade systems and the increase in switches and storage, power usage per square foot has soared, along with the heat.
The introduction of aisle containment, free air cooling, in-row cooling, water cooling, air flow monitoring, and better room design have delivered significant improvements in cooling. Some of these, such as aisle containment, can be retrofitted to a data center for limited cost and with little disruption to operations. This is critical because not only does it extend the life of a data center but it makes economic sense.
Many of these technologies, however, are only being deployed in new builds. Free air cooling, a huge subject in its own right, can be done as part of a complete refurbishment but some options such as heat wheel or large plenum, have to be part of the building fabric. Water has to be carefully designed and implemented to ensure that there is no risk of power and water coming into contact.
Another approach that can be used in any data center is the increase in input temperature. Until the early 2000s, it was not unusual for a large percentage of the computer equipment inside a data center to be on a three- to five-year lease. At the same time, advances in internal IT system cooling were not high on the agenda of manufacturers. This meant that generational replacement of hardware gave some cooling efficiency but not a huge amount.
In the last decade, however, we have had a number of significant changes. The end of the dotcom recession and the current recession have meant systems are being kept much longer. The introduction of blade systems and the massive heat increases they bring have ushered in an era of highly efficient cooling inside the systems.
As a result of all this, increasing the input temperatures into servers and storage systems can produce appreciable savings in power and cooling. The electrical cost of a fan inside a server can be less than the cost of injecting more air when it is just a single server that needs the extra cooling.
With all of this, why the doom and gloom of thermal runaway and data center meltdown?
First, there is no suggestion that any of these technologies are not fit for purpose. Each of them can cool data centers at a lower cost than simple CRAC and forced air. The risk comes due to a combination of technologies being applied either wrongly or with no proper failsafe planning.
The start point here is the input temperature. Depending on the technology used for cooling, it can take an hour or more to remove just a couple of degrees of heat from a data center. It takes far less time for heat to increase. A complete failure of cooling could see temperatures rise in minutes. Even after cooling is resumed the temperatures may continue to rise if the cooling system does not have enough excess capacity to cope.
As we increase input temperatures, we shrink the gap between acceptable input temperature and the level at which failure becomes more likely. The older the equipment, the lower that failure temperature is. As temperatures rise, the fans work harder inside equipment, pulling in more air to try and cool the equipment and that lowers the available volume of cool air for other systems.
Any cooling failure therefore, has the potential to cause not just a single system failure but to cause a cascade of failures. This is because as other systems begin to overheat they respond by drawing in more air, increasing the rate at which cool air is replaced by hotter air. This is known as a positive feedback loop.
The solution is two-fold:
• Model or test the impact of a complete cooling system failure. Identify the point and speed at which temperature rises.
• Add failover capacity that can be brought into play immediately as failure occurs to prevent the start of the overheating process.
For many data center owners, this will mean adding some cost back into the data center. While this may seem unpalatable, the alternative is likely to cost more both short term in replacement of equipment and long term in loss of trust and business.