Maintaining Cooling in a Hot Crisis
Power is not the only essential service
A 2006 AFCOM membership survey revealed that one out of every four data centers will experience a business disruption serious enough to affect the company’s ability to continue business-as-usual. Of those surveyed, 77 percent had at least one business disruption in the past five years, and 15 percent admitted this disruption was “very serious.” One of the largest concerns facing data center facility managers is the loss of cooling capacity, particularly in the middle of the night, when staff may not be present, leading to the potential loss of customer data. A cooling unit failure led to the well publicized data loss at Nokia’s Ovi personal portal last year.
Cooling failures can range from an individual cooling unit within the data center, to a chiller, which could take down multiple data centers, depending on the design. Accompanying any loss of cooling capability will be an increase in the room temperature and rack inlet temperatures. CFD modeling can be an essential tool to answer key questions on facility managers’ minds including, “What does the failure of a specific cooling (CRAC) unit do to local server inlets?” and, in the case of a major failure, “How much time will elapse before critical temperatures are reached at the rack inlets, and which of them will reach those limits first?”
A new study by Opengate Data Systems found that a typical data center running at 5 kilowatts (kW) per server cabinet will experience a thermal shutdown within three minutes of a power outage. Higher density cabinets with 10 kW will shut down in less than a minute.
Mark Monroe, chief technology advisor at Integrated Design Group, said that most data centers use set points between 68°F and 72°F. Monroe did an informal survey of 14 of data centers at one high-tech company and found that eight had the temperature set at 68°F, five at 72°F and one at 74°F, even though the corporate policy was for set points to be 74°F or above.
“If you’re running at 68°F, you’re running in the bottom quarter of the ASHRAE recommended temperature range,” he said recently. “There’s no reason why you can’t move to 78°F. This is a really simple thing to do, and you can save as much as 3 to 4 precent of the cooling system energy for each degree Fahrenheit that you increase the temperature.”
To illustrate this point, a relatively small 1,800 square foot (sq ft) raised-floor data center was modeled using CFD techniques. The room is 10-ft high with an 18-in. supply plenum. The IT equipment is distributed in twelve rows of racks, some of which contain gaps between the servers (figure 1). Each row contains five racks and with a total heat load of 315.25 kW in the room, the racks have an average heat load of 5.25 kW. The heat density in the room is 174 W/sq ft. Each of the four CRACs along one side of the room delivers 60° supply temperature at 12,000 cubic feet per minute (CFM). The combined CRAC flow rate is 18 percent above the total airflow demand from the IT equipment. Two rows of Tate GrateAire-24 tiles (56 percent open) line the cold aisles.
STEADY-STATE RESULTSAs a first step, a steady-state calculation is performed using CoolSim to obtain a picture of the data center under normal operating conditions. For the IT equipment, the most important result is the maximum inlet temperature on the racks (figure 2). The 2008 ASHRAE guidelines recommend a maximum inlet temperature of 80.6°F, but publish an allowed maximum value of 90°F. Of the 400 servers in the room, 27 are above the recommended temperature maximum and four are above the allowed temperature maximum, with values of 90°F, 90°F, 93°F, and 94°F. These four servers are in the two circled regions in figure 2. Their average inlet temperatures are all below the acceptable limit, however, with values of 82°F, 86°F, 85°F, and 83°F, respectively. This is acceptable performance for the data center as a whole, but certainly not optimal. In ideal conditions, all of the racks would have maximum inlet temperatures below the recommended maximum value.
Figure 3 shows temperature contours 3 ft above the floor. There are several high temperature regions on the exhaust sides of the racks. These produce high room temperatures and are of particular concern in this data center where there are gaps between the equipment in the racks. In figure 4, path lines of return air in one of these regions leak through gaps between the equipment and heat the supply air to unsafe temperatures. Steady-state conditions such as these provide helpful information in advance of a transient CRAC failure calculation, since probes can be positioned in the problem areas to track the increasing temperatures.
TRANSIENT CRAC FAILURE ANALYSIS: PARTIAL FAILUREThe data center model, built in CoolSim, is exported to Airpak, which performs two transient calculations. At the start of the first transient run, the two CRACs on the left side of the room (in plan view) are disabled. That is, their fans are shut down, and the CRACs are represented as hollow blocks with adiabatic boundary conditions on all sides. Because two CRACs continue to operate, this case represents a partial failure of the cooling system. Monitor points are created at four locations in the data center: two in hot aisles and two in cold aisles, as shown in figure 5. The steady state data are used as the starting point, and a transient calculation is performed for approximately two minutes following the CRAC failure. A time-step of 0.1 seconds (s) is used and data are saved every 15 s.
Figure 6 shows the temperatures recorded at the monitor points during the first 2 minutes following the failure of the two CRACs. The temperatures in all four locations change initially, but they soon stabilize at new values. One hot aisle temperature increases dramatically while the other increases but soon returns to slightly below its initial value. One cold aisle temperature shows a marked increase while the other shows a decrease.
Taking a closer look, the point with the highest final temperature, Hot_Aisle_2, is closest to one of the working CRACs. When two of the CRACs are disabled, the air rushes out from the two working CRACs to fill the plenum, and the high-speed air causes negative flow through some of the nearby perforated tiles. Figure 7 illustrates this effect, which is common in data centers, where the y-component of velocity on the top surface of the vent tiles is shown after two minutes. The velocity is negative in front of the servers that exhaust in the region of the point Hot_Aisle_2. This lack of supply air starves the servers, causing them to draw air from nearby rack exhausts instead. Contours of temperature on a plane 5 ft above the floor, shown in figure 8, further illustrate the resulting hot spots.