The 'Rackonomics' of Liquid Cooling
Show me the money.
In several of my past columns, I have discussed liquid cooling. It would appear my articles over the past five years have had their effect, since lately it seems like almost everyone is offering some form of liquid cooling system. Moreover, chip manufacturers have begun to design more powerful processors that can only be liquid-cooled, so the end must be in sight for air-cooled computing hardware.
While that prediction may seem a bit of an overreach, there is clearly a greater amount of awareness and interest in liquid cooling in different sectors, and it’s being driven by various applications and motivations. Liquid cooling was originally used on early mainframes 50 years ago. While it effectively disappeared for a few decades, it began to reemerge with new supercomputers and high-performance computing (HPC) systems. Nonetheless, air-cooled IT equipment still dominates the majority of traditional data centers. Still, the primary factors that drive the mainstream enterprise and colocation facilities markets are function, performance, and cost-effectiveness.
Cost-effectiveness can be seen through several different lenses: initial upfront investment (CapEx), operational cost (OpEx), and return on investment (ROI).
Moreover, these factors break down differently for a data center facility and the ITE. At this juncture, the marriage of air-cooled ITE and the formula for designing and building data centers is relatively mature, even though the scale of the facilities have changed in both physical space and power. The metrics for ITE relative to the facility are focused on the number of racks and the power density per rack. This ultimately becomes an economic decision as to how much space and power distribution (number and size of circuits per rack) for a given amount of ITE equipment power load (i.e. per megawatt) is required.
For example, at an average of 5 kW per rack, it requires 200 racks and 200 (400 if redundant) circuits and PDUs per rack. In the case of a colo facility, the customer pays for the required space, number of circuits, and, of course, the power/energy. Obviously, at 10 kW per rack, the number of racks, circuits, and space is reduced by 50%, but the total power/energy is the same. This ratio is clearly oversimplified — there are many other factors (network cabling and equipment as well as storage systems and types of power density of servers/bladeservers) that impact accuracy of this ratio reduction.
While there is no question that average ITE power density requirements have gone up to meet performance demands, in theory, this should require less space and number of racks per megawatt of critical load. In effect, this rack power density ratio impacts both the Capex and OpEx of the facility and the IT equipment owners/users, which is why I refer to it as “rackonomics,” a term coined in 2008 by Blade Network Technologies in relation to a different concept.
Many older data center facilities (enterprise owned or colo) were effectively limited to about 5 kW per rack, but newer facilities for air-cooled ITE are designed for 10 to 20 kW per rack. That still does not ensure that the maximum benefits of the higher-power-to-space ratio is fully realized. This ratio assumes all racks are actually loaded to near or at that power level, which is not usually the case in a mixed ITE environment. In many typical enterprise scenarios (and enterprise in a colo), the max to min kW per rack ratio is on the order of 5:1 to 10:1, and the max to average kW per rack ratio may be 3:1 to 5:1. It is these ratios that impact the rackonomics.
Even in the case of the air-cooled hyperscale environment, which uses thousands of 1U servers or blade chassis, they rarely have average power densities in or beyond 10 to 15 kW per rack. This is due in part to the fan power and fan energy costs, which tend to increase substantially as the power densities increase (both the cooling system fans and the internal ITE fans). Although hyperscalers (colo and internet-cloud services scale) build large facilities, the large building shell cost represents a relatively low percentage of the CapEx in relation to the power and cooling infrastructure cost.
Liquid cooling scenarios are primarily focused on higher power density applications (25, 50, and 100 kW per rack). These can be deployed with minimal or no impact in an air-cooled facility. For example, at 50 kW per rack, it only requires 20 liquid cooled ITE racks to support 1 MW, compared to 100 racks at 10 kW per rack. In many cases, liquid-cooled IT equipment does not need mechanically chilled water. This saves the CapEx and OpEx of a chiller as well as the cost for 80% fewer racks, PDUs, and associated electrical circuits.
One of the issues that seems to come up is that liquid cooling systems (for standard air-cooled IT), such as inrow, rear-door, or enclosed cabinets, are more expensive than traditional raised-floor perimeter cooling systems. While this is true to a certain extent, a lot of the plumbing costs are related to retrofitting them into an existing traditional system in an operating data center. It is also true that at the moment, liquid-cooled IT equipment is more expensive than their air-cooled counterparts. However, pricing is driven by several factors. The first is volume, especially if you are comparing the cost of 1U “commodity” servers (which may draw 200 to 500 W) with one or two CPUs and a relatively small amount of memory against a low-volume liquid-cooled server with high-power CPUs and power densities at 1 to 2 kW per 1U.
This is where a liquid-cooled server based on liquid-cooled heat sinks on the CPUs and memory begins to change the rackonomics. Major OEMs like Lenovo have been offering more HPC server and bladeserver models with liquid cooling. A full rack of these liquid-cooled servers can run at 50 to 60 kW, compared to three or four racks of 15- to 20-kW air-cooled servers with comparable CPU/core count and memory capacity to support the same computing workload. While not directly cost competitive as of yet (in part due to low volumes), the rackonomics ratio starts to become more apparent. The cooling OpEx also comes into play, since they can operate using warm water (ASHRAE W4).
On the other end of the spectrum, the Open Compute Project (OCP) formed the Advanced Cooling Solutions (ACS) subproject in July 2018 to focus on how liquid cooling can be used to improve cooling performance and energy efficiency, as well as lower the cost of the hardware.
This July, Cerebras Systems announced the wafer scale engine (WSE) processor, which it claims is the largest commercial chip ever manufactured, built to solve the problem of deep learning compute. The WSE has 1.2 trillion transistors packed onto a single 215-by-215-mm chip with 400,000 aritificial intelligence (AI)-optimized cores connected by a 100Pbit/s interconnect. And by the way, in case you were wondering: The WSE can draw up to 15 kW, and, yes, you guessed it ... it needs to be liquid cooled!
The Bottom Line
To be realistic, it is hard to match the relative convenience to “rack and stack” and service most air-cooled IT equipment. Moreover, we have seen a huge improvement in cooling systems for air-cooled ITE as well as the ITE itself. However, that simplicity did not just suddenly happen. It took many years for data center layouts and cabinets to evolve from the front-facing, solid-front IT cabinets with top-to-bottom airflow from the mainframe days into the hot-aisle/cold-aisle layouts that we take for granted today. In fact, even into the late ’90s, there were few universal ITE cabinets.
That being said, liquid cooling systems come in many flavors, shapes, and form factors, and they are still evolving. Liquid-cooled ITE offers many technical and “rackonomic” advantages (just ask any vendor). However, form factors and other items are not interchangeable between vendors, which holds back some buyers and may not make sense for every application or situation. Mainstream data center owners and users are risk adverse and don’t accept change quickly. When the internet giants first began to use direct air-side “free cooling” more than a decade ago, traditional enterprise data centers users just laughed.
It took quite a few years for the conservative data centers to move beyond ASHRAE’s first edition of the “Thermal Guidelines” (2004), which defined the original “recommended” 68° to 77°F environmental envelope. In 2011, ASHRAE released its third edition of the “Thermal Guidelines,” which incorporated direct air-side free cooling information. It also introduced the “allowable” environmental envelopes categories A1-A4, which went as high as 113° ITE intake temperatures.
Very few people were aware ASHRAE has had liquid cooling guidelines since 2006. This year, ASHRAE released a whitepaper titled “Water-Cooled Servers — Common Designs, Components, and Processes.” Currently, multiple organizations, including ASHRAE, The Green Grid Open Compute Project (OPC), OPEN 19, and the U.S. Department of Energy (DOE), are collaborating to create a framework, guidelines, and specifications for liquid cooling that will cover form-factor, piping, quick disconnect dripless connectors, coolant distribution manifolds, etc., for both ITE and racks.
As a response to climate change, the data center industry and ITE manufacturers have made many improvements in overall energy efficiency. Nautilus floating barges or submerged data centers, such as Microsoft’s Project Natick, tout their cooling efficiency by using seawater for cooling. Waste heat dumped into the water, instead of the air, is more energy efficient than traditional compressor-based mechanical cooling. Nonetheless, even at a PUE 1.0x, every megawatt of heat is rejected into the environment, so waterborne data centers still do not really mitigate climate change.
And while there have been some efforts for energy recovery, it is very difficult or costly to effectively harvest the waste heat of air-cooled ITE. One of the long-term benefits of liquid cooling lies in the ASHRAE W4 (up to 113°) and W5 (above 113°) categories of IT equipment, which can deliver rejected fluid temperatures at 140° to 150°. These waste heat fluid temperature ranges offer more cost-effective opportunities for a significant percentage of energy recovery.
As we enter the next decade, hyperscale data centers with gigawatt-level campuses have become the new normal. They are leading the way regarding energy efficiency and utilizing sustainable energy sources. I believe the ongoing development of liquid cooling and widening adoption will be led by these hyperscalers, and not just for the “rackconomics.”