Data Center Thermal Management
Moving beyond cooling.
In my last column, I wrote about the fate on the enterprise data. Only a few years ago enterprise facilities managers were happy if they kept their data centers “cool” (at 68°F or lower in many cases). However, as IT equipment power densities rapidly increased, they were generally satisfied if most areas were kept cool enough — hopefully with just a few hot spots (over 80°F) — which they typically tried to mitigate by “overcooling” the entire room. In 2007, The Green Grid (TGG) was formed and created the power usage effectiveness (PUE) facility energy efficiency metric and suddenly we all seemed to realize that cooling was using a lot of the energy going into the data center infrastructure.
This revelation lead to, and also coincided with the 2008 release of the 2nd edition of ASHRAE’s TC9.9 Thermal Guidelines, which increased the “recommended” temperature range to 64.4° to 80.6°F, in response to the improved thermal tolerance of modern IT hardware and to save energy. The bigger news which was often overlooked then (and even now) is that the temperatures were to be measured of the air entering the IT equipment — not the room.
This was followed by the subsequent release in 2011 of the “allowable” ranges (A1-A4), which gave data centers managers further opportunities to save “cooling” energy by increasing IT intake temperatures into the expanded ranges and increase the use of so called “free cooling.” As a result, data center cooling system designs have evolved, some to a minor degree others more radically, in an effort to meet the rising heat density and improve the energy efficiency.
This year ASHRAE released the 4th edition of the thermal guidelines, which brought to light that IT equipment can operate at even broader humidity ranges in addition to the expanded allowable ranges introduced in 2011. This helps further improve cooling system efficiency by reducing the endless humidification-dehumidification cycles, used in many systems.
So what is thermal management and how is different than cooling (free or otherwise)? While it may seem like a matter of semantics, there is an important difference from a design approach and technical level. Generally speaking, we have traditionally “cooled” the data center by means of so called “mechanical” cooling. Although I will not delve onto the thermodynamic details, this involves the vapor compression refrigeration process, which is the basis that almost all air conditioning and refrigerators have used for almost the past 100 years. This basic process requires energy to drive a motor for the mechanical compressor which drives the system (in reality it is a “heat pump” since it transfers the heat from one side of the system to another).
We have been using this same mechanical cooling process to cool the data center — no big news, but how is this different than thermal management? While the energy efficiency of mechanical cooling systems has continuously improved since its inception, it represents only a part of the total energy used to transfer the heat from the heat producing components inside the IT equipment (CPU, memory, disks drives, power supplies, etc.), to the outside air. In the case of air cooled IT hardware, there are many points along the thermal transfer route that use energy (IT fans) or have a relatively poor thermal transfer coefficient (such as the material, size and number of fins on a CPU heat sink), which requires more airflow resulting in increased IT fan energy. This also increases the airflow requirements of the fans in the cooling units (computer room air conditioning [CRAC], computer room air handlers [CRAH], or other types).
However, to better understand and optimize the process, we really need to take a holistic approach and consider the transfer heat process and energy required at each junction from beginning to end. Up until recently, almost no one on the facility side knew or cared about the internal fan energy or thermal characteristics of the IT equipment, except to know how many watts it required. There was almost no attention to IT equipment airflow requirements in cooling system designs, until the heat density of blade servers forced the issue and IT manufacturers began publishing airflow requirements and provided online configuration tools to provide both the heat load and airflow requirements for various models.
Even then, many cooling systems designers were unaware of the issue and still just specified the cooling system based on watts per square foot (e.g.,1 megawatt of IT load in a 10,000-sq-ft facility equals 100 W/sq ft). For a while that approach continued to be the norm, until the power density at the rack level went from 2 kW to 5kW and continued to increase. High-density racks generated hot spots, even though the total room heat load was less than the rated cooling system of the facility. In some cases, CRAC temperatures were lowered to try to mitigate the worse-case areas. In other cases, more CRACs were installed near the high power racks in a “brute force” attempt to try to overcome localized high temperature areas.
Only when we began to realize that just throwing more raw cooling capacity at higher power racks did not really solve the problem, did we begin to understand that poor control of the airflow from the cooling system output to the IT intake, as well as preventing the hot return air form mixing with the supply air was the issue — not the lack total cooling capacity. Thus raising awareness on the importance of airflow management became an important aspect of a cohesive thermal management strategy.
Like many traditional enterprise data center practices, new concepts are not readily accepted, much less immediately adopted, and while it is considered normal today, it took many years for the industry to implement the basic cold aisle and hot aisle layout. Although it is more commonly deployed now, it took a long time to accept and implement containment (either hot or cold aisle) or other airflow management methods, such as chimney cabinets. Cohesively monitoring, controlling, and optimizing the entire end-to end process as a complete system in the data center, is the essence of thermal management. This occurs in both the design and operating stages of the entire data center — which should include understanding how heat is managed and transferred inside the IT equipment.
Previously, IT equipment had a limited range of minimum to maximum power draw in relation to workload. With the focus on energy efficiency, we began to see a much larger dynamic power range in relation to the computing activity, proving an overall energy reduction and resulting in energy proportionality to computing loads.
For example, only a few years ago a typical low cost 1U commodity server, with a single CPU and 400 W power supply might have idled 150 Watts and drawn at 250 W under full load. In contrast, and Energy Star server with similar performance would idle at only 55 W, but might draw 200 to 250 W at full load resulting in a 4-5 to 1 ratio of heat load.
In addition, the thermal management systems in the servers (and now other Energy Star equipment) are designed to suppress fan speeds whenever possible to save energy. This results in a much wider range of airflow during normal operations and equipment can also have a 4-5 to 1 range of cubic feet per minute (CFM) requirements. The indirect result is that the temperature increase (delta-T) will also vary and can range from only 10°F to as much as 40°F (or even higher in some cases).
Getting data center designs to accommodate and adapt to these changes are all part of end-to-end thermal management. For instance, in the Facebook-Open Compute design (which uses direct outside air as a primary source of cooling), the airflow management system uses static pressure control to ensure that as server fans increase in speed, the building fans also speed up to maintain a slightly positive supply air pressure. Some newer cooling systems also offer the option of underfloor static pressure control of variable-speed fans. In those cases it still relies on the data center personnel to try to match the floor tile grills to the estimated air of a rack or area. However, if properly coupled with cold aisle containment, the supply airflow can become somewhat more proportional to the IT airflow demands.
Furthermore, variable heat loads were also related to virtualized systems which also could control that state and number of servers to service a variable work load, resulting better resource efficiency but also further increased power and heat load dynamic ranges. In some cases this also caused traveling hot-spots as servers were spun-up or reacted to greater loads.
As a result, some vendors began to offer dynamic airflow solutions such as fan assisted floor tiles with remote temperature sensors that were placed at the top of the rack in front of the fan-tile. Of course this required additional energy to power the fans in the floor tiles (as well as installing more cabling under the floor). More recently some vendors have offered motorized dampers which are controlled by rack based temperature sensors, the dampers open and close to try to maintain temperature at the rack. While perhaps an interesting, solution for a data center with significant airflow issues that cannot be solved by other simpler methods such as aisle containment, it should not be the basis of a new design.
We have traditionally purchased IT equipment from the manufacturer based on the system performance specifications and components (CPU, memory, storage, network) and then simply designed the “cooling” requirements based on the heat generated. This was simply expressed in terms of watts (or Btu). So if we needed to cool 100 kW of heat load, we calculated that we needed at least 30 tons of cooling. But actual experience told us that one 30 ton cooling unit would not be sufficient. So most data centers added 20% to 30% to the cooling capacity (before adding units for N+1 or 2N redundancy), based on nameplate ratings. This was the conservative “overcooling” approach that favored extra cooling capacity to minimize risk, even if it meant higher capital cost for the cooling system, as well as lower energy efficiency.
Moreover, the thermal transfer process (and the need for thermal management) continues once the heat has been transferred to the coils of the cooling units (assuming we are speaking of a closed loop “cooling” system). In a CRAC DX system, the basics of the compressors and evaporator coil design and temperatures have been relatively fixed until recently. Previously, temperatures were basically controlled by turning the compressors on and off (hopefully not frequently), to vary the average supply temperature. And in the case chilled water CRAHs, cooling capacity and supply temperature were varied by a motorized valve to regulated the amount of chilled water flowing though the coil.
However, up until recently in most cases the temperature control setpoints of the cooling units were not sensing the supply temperature, they were sensing the return temperature. This meant that if you set the control system for 70°F it would engage the cooling function once return temperatures went above 70°F — resulting in supply temperatures of 50°F due to the typical delta-t of 18° to 20°F of most cooling units. In effect, overcooling was inherent in the system design yet unknown to most operators of the system who set the temperatures. Even when they wanted to improve efficiency by raising the set-point to 75°F, this still would cause the cooling system to deliver 55°F supply air. While I am not going to discuss details of the other elements of the thermal transfer processes (such as pumps, chillers, cooling towers and any other heat rejection equipment), in this article, they also can be dynamically managed and optimized as part of the thermal management system.
THE BOTTOM LINE
If some of “thermal management” sounds like the functions that some of the existing BMS systems already do, while it is are similar, but it does not monitor or adapt to what is happening at the rack and IT equipment. However, we are just beginning to see just the tip of the iceberg regarding end-to-end thermal management, as some DCIM systems coupled with CFD analysis begin to help us understand the dynamic thermal conditions as IT equipment changes their heat load and airflow in response to workloads. In some cased DCIM software can control the “cooling” system in the facility in response to temperature sensors in the aisles and racks. Only by measuring temperatures entering the IT equipment (either by sensors or by polling the IT equipment for airflow and delta-T), will have a basis to optimize thermal management under changing conditions and during cooling equipment maintenance or failure.
To help better understand and visualize this, TGG recently released the Performance Indicator (PI) metric which is described as a method for assessing and visualizing data center cooling performance in terms of key metrics – thermal conformance, thermal resilience, and energy efficiency. This introduces and defines several key aspects:
PUE ratio (PUEr): How effectively is the facility operating in relation to defined energy efficiency ratings?
IT Thermal conformance: Acceptable IT temperatures during normal operation
IT Thermal resilience: Acceptable IT temperatures during cooling failure or maintenance
While it would seem complex at first, it brings scrutiny on the tradeoffs regarding cooling system energy efficiency, as well as ensuring the effectiveness of redundant cooling units (and their impact maintaining airflow to IT equipment), under varying loads and system failure conditions.
Hyperscalers, such as Facebook and Google who have the luxury of designing their IT equipment and the facility as part of a cohesive thermal management strategy have been able to reduce the energy substantially. And while enterprise “reliability” requirements are different than the search and social media, ultimately, the data center designers need to pay close attention to the changing characteristics of the IT hardware, and utilize control systems that can incorporate and respond to variable cooling loads, while maintaining redundant cooling systems.
Moreover, while we will continue to improve and manage air cooled IT equipment (as well as the entire thermal transfer equipment chain), it will always be much less thermally efficient when compared to liquid cooled IT hardware due the fact that fluids are several thousand times more effective at thermal transfer than air. This allows far greater CPU performance, power density, and overall energy efficiency, as well as waste energy reuse. So while liquid cooling is not yet a mainstream solution for most data centers, it seems to be gaining momentum, so stay tuned as we will examine the recent developments in liquid cooling.
For the immediate future, most enterprise data centers will continue to use standardized OEM IT hardware, but if it is to survive the age of the cloud based hyperscale services, it needs to have cooling systems that are more intelligent and interactive with the dynamic heat load and airflow characteristics of the IT equipment. Only then will we begin to fully realize true thermal management of the IT equipment and be able to optimize the entire “chip-to-atmosphere” heat transfer processes, rather than just data center cooling.