It’s been six years since ASHRAE relaxed its operating environment recommendations for data centers. While there are significant energy savings opportunities afforded by the current guidelines, many operators are reluctant to take advantage of these, wary of potential systems failure as temperatures are raised. A 2012 survey of participants at Gartner’s Annual Data Center conference found that only 40% of participants had facilities operating with setpoints higher than 72°F.
For many years, temperatures between 68°F to 72°F were considered optimal for data center operations. This can be traced back to practices from the 1950s. The Green Grid notes, “There is evidence that the operating range was initially selected on the suggestion that this choice would help punch cards from becoming unstable.” (The Green Grid, 2012) Data center visitors are no doubt familiar with the “ice box” conditions prevalent in many facilities.
In 2004 ASHRAE’s Technical Committee 9.9 published its first set of guidelines targeted exclusively at data center environments. Interestingly, the primary goal of that release was to establish a common set of environmental operating conditions for IT equipment manufacturers. The work was actively promoted by the committee and was soon recognized as guidance for data center operators as well. Indeed the ASHRAE recommendations have been incorporated into the ANSI/TIA 942 and BICSI 002 standards.
In Europe, ETSI’s EN 300 019-1-3 provides similar guidance for data centers and other enclosed locations whereas, telecom central offices are addressed in Telcordia NEBS documents GR-63-CORE and GR3028-CORE, ANSI T1.304 (1997), and the European ETSI EN 300 series standards.
ASHRAE’s guidelines have been updated twice (in 2008 and 2012). Each update includes expansion of some of the guidelines commensurate with IT hardware developments and improved data center practices. The current version allows “near-full-time use of free cooling techniques in most of the world’s climates.” A comparison of the various guidelines is given in Figures 1 and 2.
As far as IT hardware operation goes, it is important to consider that only the inlet air temperature is important, a factor explicitly noted in the recommendations. This is typically different from both the supply and return air temperatures noted at the air-handling unit feeding the data center. These relationships are shown in Figure 3. Note that the heat gain across the equipment can be 7°F to 60°F. Thus, sensor placement must be considered when establishing the facility setpoint.
A recent look at 17 facilities showed that although the majority had temperature setpoints of 70°F and higher, most of these were controlled based on return air temperature. After accounting for the server heat gain and mixing within the facility, the inlet air temperatures in these facilities ran between 49°F and 60°F. The facilities were indeed operating below the recommended envelop; a condition that some studies show can hurt hardware reliability. Facilities operating this way are also less energy efficient. This is examined below.
Referring once again to Figure 3, it is worthwhile to note that the temperature gradient between the top and bottom of the cabinet can be as high as 15°F to 25°F. Raised floor facilities without aisle containment tend toward higher temperature differences as air mixing and stratification increase. In such facilities, “super cooling” the supply air (below 68°F) can help ensure that the equipment at the higher levels doesn’t overheat. Of course, this approach means equipment located closer to the floor is over cooled, with the penalties noted above. Improved airflow management is generally a better approach to super cooling in such situations.
Saving energy and utility costs is one of many factors that data center operators must balance. A higher supply air temperature can improve efficiencies throughout the cooling infrastructure through these mechanisms:
• Cooling equipment coefficient of performance improves at higher temperatures.
• Higher operating temperatures increase cooling equipment capacities.
• Economizer effectiveness and availability is greater at higher setpoints.
• Raising space temperature setpoints enable variable-frequency devices to operate at reduced speed.
Obviously, the savings that can be achieved in a particular facility depends on its location, HVAC system architecture, and the specific equipment employed. Figure 4 illustrates the relative energy use of sample systems at one location and various supply temperatures.
While the greatest advantage is achieved for direct air systems, all of these architectures demonstrate greater infrastructure efficiency at higher supply air temperatures. Based on this scenario, a facility with a conventional chilled water central plant without an economizer could save 5% of its data center cooling costs by increasing the setpoint from 68°F to 78°.
Just as facility infrastructure is affected by the operating setpoints, so too is IT hardware. Servers in particular, and to a lesser extent, storage and networking equipment, are designed to protect themselves in extreme operating environments. When a microprocessor’s case temperature exceeds 165°F over an extended period, high leakage currents and semiconductor breakdown can result. To protect the chip, manufacturers will throttle clock speeds at extreme temperatures. This affects processor performance. At more modest temperatures, the fans in the system’s power supplies will ramp up, increasing airflow across the hardware in an effort to cool the unit. While this maintains processing performance, the power consumption of the IT equipment goes up as the fans speed up. This increased energy use can offset the infrastructure efficiencies resulting from the higher setpoint.
A sample system curve is given in Figure 5. For most systems, power draw is roughly constant to an inlet temperature of 77°F and can increase by 5% to 20% by 105°F for a given processor load.
By suitably modeling both the HVAC infrastructure and the IT loads, a sweet spot for operation can be determined.
The foremost reason given for not increasing the ambient environment in data centers is the effect on IT reliability. One oft-quoted reliability prediction is that each 18°F rise in operating temperature reduces the lifespan of electronic components by 50%. This model suggests that IT systems should be kept as cool as possible to maximize their reliability. It should be noted, however, the Arrhenius model on which this prediction is based, was developed in 19th century to predict the reaction rates of chemical reactions.
While this continues to be applied by some to predict reliability of semiconductor devices, it is perhaps the simplest of the 20+ methods presently used for reliability prediction in electronic products. Moreover, the Arrhenius model applies only to the physics of semiconductor failures. Several large scale studies have shown that hard drive and power supply problems are the leading cause of server failure. (Schroeder & Gibson, A large-scale study of failures in high-performance computing systems, 2006) (Schroeder & Gibson, Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?, 2007) Failure physics models like Arrhenius don’t apply to such malfunctions. Indeed, even more comprehensive models, like the MIL-HDBK-217, have limited use because they address only component failure mechanisms while ignoring system factors.
A number of empirical studies have looked at IT equipment failure as a function of operating temperature. (Ell-Sayed, Stefanovici, Amvrosiadis, Hwang, & Schroeder, 2012) (Sankar, Shaw, Vaid, & Gurumuthi, 2013) (Pinheiro, Weber, & André Barroso, 2007) Interestingly, these have produced some contradictory results. Looking at the studies in toto, these observations stand out:
• There is a strong correlation between manufacturer, model, and device failures.
• Utilization is not correlated with failure rates.
• Equipment failures increase with operating temperature at a linear rate slower than reliability models predict. This increase is small in comparison with the magnitude of the existing failures. (One study actually indicated a negative correlation between these factors.)
• Temperature variability may be more important than average operating temperature.
Dell and Intel have performed their own tests of IT equipment in extreme climates and have seen “negligible differences in failure rates.” However, data center operators and IT operators are likely more interested in what the manufacturers will actually stand behind.
With these warranties, manufacturers are saying “my product can run in this environment throughout its life without operational impact.”
This isn’t to say that operators should blindly change temperature setpoints. Anyone contemplating increasing the setpoint of their data center will want to consider a number of additional factors, including:
• What legacy systems are present? Some older equipment has greater environmental limitations.
• To what extent have best practice airflow management strategies been implemented? As the temperature goes up, any hotspots within the data center will become that much hotter.
• What headroom does the facility have to accommodate peaks or equipment failures? In high density facilities, rate of temperature rise must also be considered.
• How is temperature controlled? As noted above, the setpoint should be adjusted based on where measurements are taken.
• To what granularity can temperatures within the white space be seen? This can help operators fine tune facility performance.
Given the pressure most facilities face to cut costs, it makes sense to consider how environmental setpoints impact operations. Higher temperatures save energy. The inflection point of this relationship is higher than the operating setpoint of most facilities. Increased temperatures don’t affect IT operation significantly, as recognized by both manufacturers and industry groups. Armed with this information, perhaps more facility operators will be willing to explore higher supply temperatures.