A 2006 AFCOM membership survey revealed that one out of every four data centers will experience a business disruption serious enough to affect the company’s ability to continue business-as-usual. Of those surveyed, 77 percent had at least one business disruption in the past five years, and 15 percent admitted this disruption was “very serious.” One of the largest concerns facing data center facility managers is the loss of cooling capacity, particularly in the middle of the night, when staff may not be present, leading to the potential loss of customer data. A cooling unit failure led to the well publicized data loss at Nokia’s Ovi personal portal last year.
“A cooler broke down in the hosting center that we run the Chat service in,” explained Kristian Luoma, product manager for Contacts on Ovi. “This event led to two catastrophic consequences from our point of view. First, we had to ramp down the service for a very long period, in fact most of the morning and afternoon before we had a service break. Second, our database broke down. Despite the fact that we had regular back-ups, we were not able to set it right.”
Cooling failures can range from an individual cooling unit within the data center, to a chiller, which could take down multiple data centers, depending on the design. Accompanying any loss of cooling capability will be an increase in the room temperature and rack inlet temperatures. CFD modeling can be an essential tool to answer key questions on facility managers’ minds including, “What does the failure of a specific cooling (CRAC) unit do to local server inlets?” and, in the case of a major failure, “How much time will elapse before critical temperatures are reached at the rack inlets, and which of them will reach those limits first?”
As the pressure to reduce data center energy consumption (and consequential carbon emissions) continues to rise, data center managers are encouraged to increase the cooling system baseline temperatures, known as a set point, inside their data centers. An energy savings of approximately 4 percent can be obtained for every degree of upward change in the cooling system set point. Yet many data center managers are reluctant to increase the cooling set points due to the fear that hot spots will result and the overall response time to cooling system failure will be decreased. Here again, CFD modeling can accurately predict this scenario, providing data center managers with valuable information such as what air management techniques can eliminate hot spots, how much time will elapse before servers reach a critical level, and which of the servers is most vulnerable.
A new study by Opengate Data Systems found that a typical data center running at 5 kilowatts (kW) per server cabinet will experience a thermal shutdown within three minutes of a power outage. Higher density cabinets with 10 kW will shut down in less than a minute.
“Thermal runaways can wreak havoc on a data center causing instant data loss and lost revenue,” said Martin Olsen, director of product management and development for Active Power, the maker of UPS flywheel systems that sponsored the study. But if the facility is well managed, the benefits of raising the set point outweigh the risks. Further, in the case of a complete cooling failure, the temperature rise is so rapid that the increased set point does not help much.
Mark Monroe, chief technology advisor at Integrated Design Group, said that most data centers use set points between 68°F and 72°F. Monroe did an informal survey of 14 of data centers at one high-tech company and found that eight had the temperature set at 68°F, five at 72°F and one at 74°F, even though the corporate policy was for set points to be 74°F or above.
“If you’re running at 68°F, you’re running in the bottom quarter of the ASHRAE recommended temperature range,” he said recently. “There’s no reason why you can’t move to 78°F. This is a really simple thing to do, and you can save as much as 3 to 4 precent of the cooling system energy for each degree Fahrenheit that you increase the temperature.”
To illustrate this point, a relatively small 1,800 square foot (sq ft) raised-floor data center was modeled using CFD techniques. The room is 10-ft high with an 18-in. supply plenum. The IT equipment is distributed in twelve rows of racks, some of which contain gaps between the servers (figure 1). Each row contains five racks and with a total heat load of 315.25 kW in the room, the racks have an average heat load of 5.25 kW. The heat density in the room is 174 W/sq ft. Each of the four CRACs along one side of the room delivers 60° supply temperature at 12,000 cubic feet per minute (CFM). The combined CRAC flow rate is 18 percent above the total airflow demand from the IT equipment. Two rows of Tate GrateAire-24 tiles (56 percent open) line the cold aisles.
STEADY-STATE RESULTSAs a first step, a steady-state calculation is performed using CoolSim to obtain a picture of the data center under normal operating conditions. For the IT equipment, the most important result is the maximum inlet temperature on the racks (figure 2). The 2008 ASHRAE guidelines recommend a maximum inlet temperature of 80.6°F, but publish an allowed maximum value of 90°F. Of the 400 servers in the room, 27 are above the recommended temperature maximum and four are above the allowed temperature maximum, with values of 90°F, 90°F, 93°F, and 94°F. These four servers are in the two circled regions in figure 2. Their average inlet temperatures are all below the acceptable limit, however, with values of 82°F, 86°F, 85°F, and 83°F, respectively. This is acceptable performance for the data center as a whole, but certainly not optimal. In ideal conditions, all of the racks would have maximum inlet temperatures below the recommended maximum value.
Figure 3 shows temperature contours 3 ft above the floor. There are several high temperature regions on the exhaust sides of the racks. These produce high room temperatures and are of particular concern in this data center where there are gaps between the equipment in the racks. In figure 4, path lines of return air in one of these regions leak through gaps between the equipment and heat the supply air to unsafe temperatures. Steady-state conditions such as these provide helpful information in advance of a transient CRAC failure calculation, since probes can be positioned in the problem areas to track the increasing temperatures.
TRANSIENT CRAC FAILURE ANALYSIS: PARTIAL FAILUREThe data center model, built in CoolSim, is exported to Airpak, which performs two transient calculations. At the start of the first transient run, the two CRACs on the left side of the room (in plan view) are disabled. That is, their fans are shut down, and the CRACs are represented as hollow blocks with adiabatic boundary conditions on all sides. Because two CRACs continue to operate, this case represents a partial failure of the cooling system. Monitor points are created at four locations in the data center: two in hot aisles and two in cold aisles, as shown in figure 5. The steady state data are used as the starting point, and a transient calculation is performed for approximately two minutes following the CRAC failure. A time-step of 0.1 seconds (s) is used and data are saved every 15 s.
Figure 6 shows the temperatures recorded at the monitor points during the first 2 minutes following the failure of the two CRACs. The temperatures in all four locations change initially, but they soon stabilize at new values. One hot aisle temperature increases dramatically while the other increases but soon returns to slightly below its initial value. One cold aisle temperature shows a marked increase while the other shows a decrease.
Taking a closer look, the point with the highest final temperature, Hot_Aisle_2, is closest to one of the working CRACs. When two of the CRACs are disabled, the air rushes out from the two working CRACs to fill the plenum, and the high-speed air causes negative flow through some of the nearby perforated tiles. Figure 7 illustrates this effect, which is common in data centers, where the y-component of velocity on the top surface of the vent tiles is shown after two minutes. The velocity is negative in front of the servers that exhaust in the region of the point Hot_Aisle_2. This lack of supply air starves the servers, causing them to draw air from nearby rack exhausts instead. Contours of temperature on a plane 5 ft above the floor, shown in figure 8, further illustrate the resulting hot spots.
One final point should be made regarding the partial failure mode represented here. In order to reach equilibrium with two CRACs shut off, the remaining two CRACs have to work harder so that an overall heat balance is achieved in the room. While these CRACs can operate above their rated cooling capacity in the model, they may not be able to do so in practice. This exercise therefore illustrates the changes to the flow and thermal patterns in the room, but the steady-state condition reached may not correspond to a situation that is sustainable without further study of the specific CRACs in use.
TOTAL FAILUREThe second failure scenario studied corresponds to the case where all four CRACs are suddenly disabled. All supply and return fans are removed from the CRACs, and only the blocks representing their housing remain. Because there is no mechanism in place for heat removal, this simulation will not reach steady state. Instead, the temperature throughout the data center will continue to rise with time. Temperatures at the four monitor points are shown during the first minute (see figure 9). The temperatures in both the hot and cold aisles fluctuate until 40 or 50 s, after which they increase at a steady rate, with the two cold aisle temperatures approaching the same value and the two hot aisle temperatures doing the same.
Figure 10 illustrates the rapid change in the temperatures in the room, where four images represent the conditions at 30 s intervals. The minimum temperature is 60°F and the maximum is 200°F. It is clear that in the case of a complete failure, there is very little time available before back-up generators start up to continue cooling the data center regardless of the initial set point.