Liquid Cooling Moves Upstream to Hyperscale Data Centers
Is this the inflection point for liquid cooling?
You may have noticed that liquid cooling has had a lot more activity and press coverage this past year. Nonetheless, to many users in the traditional enterprise data center and colocation providers, it still has the reputation of being seen as a niche solution for limited and specialized computing segments, such as high performance computing (HPC) and supercomputing applications. While the thermal transfer effectiveness and energy efficiency of liquid cooling compared to air are well known, enterprise deployments have been relatively limited. Yet, while I am a great proponent of liquid cooling, to be fair, it is not always the best solution for every application. There are still misunderstandings about when and how liquid cooling can be (or even if) it should be implemented in a “conventional” data center environment.
Part of the issue is the relative simplicity and convenience of deploying existing air cooled IT equipment (ITE); just rack and stack servers in a cabinet and you are done. Moves and adds and changes (MAC) are easy, just bring a screwdriver. Everything works fine until your ITE cabinet power density increases beyond 5 kW and thermal issues arise. The dreaded “hot spot” becomes more common, especially since the introduction of blade servers. For example, a classic 10,000 sq ft raised floor facility designed for a 1 MW critical load has an average power density of 100 W/sq ft. When filled with 300 cabinets it represents an average of 3.3 kW per cabinet. In contrast, a typical blade server chassis can draw 4 to 6 kW per 8U to 10 U chassis, trying to install 4 blade chassis results in 16 to 24 kW per cabinet … and you can guess the result. Depending on the age and the design of the data center, this becomes a tactical issue once you start moving past installing a few blade servers here and there. So while you have some leeway by spreading out the higher density ITE cabinets, it becomes more difficult, if not impossible, to cluster them together — which is normally what IT departments want to do with blade servers and SAN storage systems. Even if you use 1U servers which can draw 200 to 500 watts each, 40 servers in a rack represents a similar density issue of 8 to 20 kW per cabinet.
One of the more common solutions is adding close-coupled cooling systems, also called supplemental cooling. This was initially introduced as row based cooling over 15 years ago in response to the blade server. However, it was originally met with resistance since it commonly used chilled water to the row cooling units placed right next to the ITE — and so began the fear of water in the data center — which I later termed as “data center hydrophobia” in 2014.
There are many variations of “liquid cooling” based on the concept of close coupled cooling for air cooled ITE, such as aisle based overhead cooling systems and enclosed cabinet cooling where the cooling coil is incorporated into the cabinet, or added to existing cabinets in the form of “rear door cooling.” Moreover, water is not the only liquid; there are a variety of systems that use dielectric fluids such as refrigerant or engineered fluids to mitigate the concerns of the fear of water leaks impacting the ITE. While these are becoming more accepted as field experience and actual deployments are running at 15 to 30 kW per cabinet, this has helped provide a comfort level in the data center community; however they still represented only a small percentage of cooling systems for air cooled ITE.
The more recent liquid cooling developments include the modification of standard air cooled ITE, by means of replacing the heat sinks on the CPUs with liquid cooled heat sinks (i.e., liquid cooled heat exchanger LCHX), which removes the largest source of heat in ITE. While originally this was seen as an unapproved modification by the server manufacturers and would violate the warrantees, more recently this has an approved modification by some of the CPU and server manufacturers (some even have liquid cooled heat sinks on the memory chip). Of course this really crossed the hydrophobia threshold, since liquid now had to be plumbed and flow through to each server, something that not every IT department was ready to accept or be comfortable with. However, that may be beginning to change as more OEMs offer it as factory approved option.
And then there is the ultimate in thermal solution — immersion cooling of IT hardware, which has been on the periphery since the beginning of the decade, originally introduced by Green Revolution Cooling, which involved submerging the ITE into a tub of non-conductive dielectric fluid (such as mineral oil), which effectively absorbs 100% of the heat from all of the components. This typically involves using standard 1U servers or open motherboards that had been modified by removing the fans and modifying or removing the hard drives. Another more recent implementation of immersion cooling involves using industry standard server cards in sealed modules using dripless connectors. This allows for easy installation and removal of server modules in a manner similar to blade servers.
While this may not be for everyone, it was an ideal thermal solution for the some HPC and super computers. Think in terms of 100 to 200 kW per cabinet. Even more interesting, it was jumped on by the Bitcoin miners which built some fairly large scale facilities such as the Bitfury 40-MW facility in 2015 using phase change immersion cooling fluids (using 3M Novec). Of course, as the price of Bitcoins and other cyber currencies fell, so did the economics of mining (liquid cooled or otherwise).
With each passing year, liquid cooling advocates proclaim (or hope) that we are on the cusp of an upswing in adoption. But now it seems that we may actually be at the beginning of a true inflection point, as demonstrated by Google’s third generation of artificial intelligence AI processor: the Tensor Processing Unit (TPU) 3.0, which was announced in May of 2018 and requires liquid cooled heat sinks due to its significantly higher power than their previous generations of TPUs. What makes this all the more significant is that this is a production system, not just as a test platform. Google is using this to provide their AI as a cloud service.
So just as Google and the other hyperscalers originally led the way and changed the rigid perception of maintaining the data center at 68°F and 50% relative humidity (RH) by successfully operating at higher and wider environmental ranges by implementing direct use of outside air to provide “free cooling” which was considered unthinkable 10 years ago, this may help drive more widespread acceptance of liquid cooling. Moreover, it also may have helped prompt ASHRAE to reconsider the relatively narrow environmental envelope which resulted in the introduction of the expanded A1-A4 “allowable” ranges in ASHRAE Thermal Guidelines 3rd edition in 2011 and was broadened again in 2015 in the 4th edition.
Moreover, while the industry is well aware of the ASHRAE Thermal Guidelines for air cooled IT, fewer are aware that the 2015 guidelines now also cover liquid cooling categories W1 to W5. Despite the ASHRAE Guidelines, there are still many myths and misnomers about liquid cooling in the industry. To address this, The Green Grid (TGG) published the Liquid Cooling Technology Update white paper #70 and released it to the public in 2017. It reviews the factors to consider for new facilities, as well as when it may make sense to consider adding liquid cooling to accommodate very high-density applications in existing data centers. However, it also includes recently developed liquid cooling technologies that may not be covered by current ASHRAE publications. The paper defines and clarifies liquid cooling terms, system boundaries, topologies, and heat transfer technologies and it can be downloaded for free. Access it at https://bit.ly/2sE6M3c.
There are other recent signs that liquid cooling may become more commonly accepted. At Skybox Datacenters in Houston, DownUnder GeoSolutions (DUG) is building a huge geophysically-configured supercomputer, fully submerging standard HPC servers into specially-designed tanks filled with polyalphaolefin dielectric fluid.The use of liquid cooling allowed cost effective, energy efficient deployment of high density computer modeling for energy companies, bringing new levels of performance to oil and gas exploration.
The bottom line
Intel and other chip manufacturers can and want to deliver faster more powerful processors, however, they require 250 to 500 watt of heat removal, which makes liquid cooling virtually mandatory. Nevertheless, in mainstream data centers, air cooled ITE still predominates, thus limiting processor power levels.
Nonetheless, while I and others have been discussing the developments and advantages of “liquid cooling” for years, I believe it is clearly the next logical step for all the hyperscalers, as evidenced by the fact that the Open Compute Project (OCP) is responding to an industry need to collaborate on liquid cooling and other advanced cooling approaches. OCP has formed the “Advanced Cooling Solutions” to harmonize the supply chain, thus enabling delivery of building blocks for wider, quicker, and easier adoption of liquid cooled servers, storage, and networking gear.
There is also the “Open Standards Harmonization Working Group” which includes Lawrence Berkeley National Laboratory, Intel, Facebook, LinkedIn, Google, and Microsoft, as well as the China Institute of Electronics, Alibaba, Baidu, and Tencent. The work group released a progress update of the draft of the “Open Specification for a Liquid Cooled Server Rack” in June of 2018.
Beside the fear of leaks, the general perception is that liquid cooling is more expensive to purchase and maintain. Cost is also an important factor, and many users have the perception that liquid cooled servers are more expensive than air cooled servers. While this is somewhat true at the moment, it is primarily due to lower volume of liquid cooled servers and the associated heat transfer systems. In reality, liquid cooled servers can reduce or eliminate the need for mechanical chillers since the CPU chip can be “cooled” by “warm water” (ASHRAE Class W4: 95-113°F) or even “hot water” (W5 above 113°F).
As various large scale projects by the hyperscalers validate the IT performance increases and energy efficiency benefits of liquid cooling, it will help increase adoption and volume manufacturing, which in turn will accelerate lower prices. This will ultimately make liquid cooled systems more cost effective to purchase and operate than air cooled ITE.
So will this be the year that breaks the liquid cooling logjam? That is yet to be seen, but if there is anything that drives hyperscalers (as well as most businesses) to adopt new technology — it’s the flow of money!