Figure 1. As early as 1962, researchers at Bell Labs established a relationship between temperature and electronic component reliability (Dodson, 1961). Click here to enlarge
Since the early days of electronic data processing, computer rooms and data centers have controlled environmental conditions that surround the computing equipment in order to improve system reliability and application availability. The American Society of Heating, Refrigeration, and Air-conditioning Engineers (ASHRAE) first published specifications addressing the acceptable temperatures and humidity ranges in 2004 and updated the specification in 2008. Most operators of data centers use the ASHRAE specs to define the environmental operating ranges for their facilities.
ASHRAE updated the specifications again in May, 2011, to reflect industry movement toward energy efficiency. ASHRAE worked with information technology (IT) equipment manufacturers to develop updated temperature and humidity ranges. The 2011 version classifies computer facilities into six broad categories of environmental control and provides guidance about the balance between potential energy savings and computer equipment reliability. Specifically, ASHRAE extended the range of allowable temperatures and humidity in data centers to match the increased desire of operators to take advantage of free cooling opportunities. Operators around the U.S. might be able to use free cooling an average of 18 percent more hours, with some locations able to use 1,600 more hours per year, if data center operators are simply willing to run equipment anywhere in the class A1 “allowable” ranges of the specifications. If data center managers are willing to occasionally run their data center in the class A3 Allowable range, virtually every location in the U.S. can achieve 100 percent free cooling.
The financial implication of this adjustment in operations is an average annual savings of $67,000 per year per 1,000 kilowatt (kW) of IT load, with absolutely no capital expenditure and an implementation time that takes days.
The new ASHRAE paper also includes information about the potential increase in temperature-related failures based on IT systems operating for a time at higher inlet temperatures, so that operators can quantify the potential impact of energy savings on their overall reliability. The calculations show that the impact of higher temperatures is hard to detect in a large population of machines, maybe as little as one additional failure per year in a population of 1,000 servers. In other words, there should be no measurable impact on availability while enabling millions of dollars in savings.
Table 1. A comparison of temperature/RH values in the 2004 and 2008 ASHRAE thermal guidelines. Click here to enlarge
When considering the impact of changes to the recommended and allowable temperature and humidity ranges inside modern data centers, one may ask the basic question, “Why do we condition data centers at all?”
The answer lies back at the beginning of the computer era when big mainframes used large amounts of power, and electronics in the boxes were more fragile. Early computer experts noticed that rooms needed to be cooled in order to keep their big, power-hungry mainframes from overheating. In the 1980s and 1990s, minicomputers and volume servers became more standard in computer rooms, offering better reliability and wider environmental operating ranges, but computer rooms remained cold with highly controlled humidity for fear of upsetting the delicate IT apple cart.
There is a basis for the sensitivity of IT operators to temperature. As early as 1962, researchers at Bell Labs established a relationship between temperature and electronic component reliability (Dodson, 1961). Based on work done by chemist Svante Arrhenius, IT manufacturers began using the Arrhenius equation to help computer equipment manufacturers predict the impact of temperature on the mean time between failures (MTBF) of the electronics. The higher the equipment’s operating temperature, the shorter the time it takes to breakdown microscopic electronic circuit elements and, ultimately, cause a computer equipment failure.
IT manufacturers use Monte Carlo and other modeling techniques to develop the predicted MTBF for each model of computer vs. the server inlet temperature. Monte Carlo models provide a range of probabilities for the MTBF value and typically have an uncertainty range of ±10 percent. Over the temperature range shown on the chart (see figure 1), the Arrhenius model shows that MTBF at 25ºC inlet temperature could be anywhere between 120,000 and 146,000 hours, or about 15 years of continuous operation. Running the server at 40ºC could change the MTBF to somewhere between 101,000 and 123,000 hours, or about 13 years of operation.
The fact that the range of predicted MTBF values at 25ºC and 40ºC overlap means there may be no impact at all on server reliability over this temperature range for this model of server. There have been studies with observational data that support this notion. E. Pinherio, W.D. Weber, and L. A. Barroso’s (2007) paper, “Failure Trends in a Large Disk Drive Population,” which is a study of more than 100,000 disk drives, found no detectable relationship between operating temperature and disk drive reliability.
Figure2. The psychometric chart compares the allowable and recommended operating ranges from the 2004 and 2011 thermal guidelines. Click here to enlarge
Computer manufacturers have also been making systems more robust, and operating temperature ranges have changed in conjunction with these efforts. Fifteen years ago, systems like the Sun Microsystems Enterprise 10000 system required operating temperatures restricted from 70ºF to 74ºF. Today’s equipment is typically specified to operate in the temperature range from 41ºF to 95ºF (5ºC to 35ºC). Many manufacturers offer equipment that can operate at the even higher temperatures specified by the Network Equipment Building System (NEBS) standards of 41ºF to 104ºF (5ºC to 40ºC). Dell recently changed its operating specification to allow non-NEBS equipment to operate at 104ºF (40ºC) for 900 hours per year, and as high as 113ºF (45ºC) for up to 90 hours per year without impacting system warranty.
ASHRAE’S DATA CENTER TEMPERATURES
ASHRAE’s Technical Committee 9.9 (TC 9.9) developed the first edition of the book, Thermal Guidelines for Data Processing Environments, in 2004. This was a big step forward: with the help of engineers from computer and facility system manufacturers, ASHRAE developed a general guideline that all the manufacturers agreed on, so that data center operators could point to a single reference for their temperature and humidity set points. The 2004 specification resulted in many data centers being designed and operated at 68ºF to 72ºF air temperatures and 40 to 55 percent relative humidity (RH).
ASHRAE updated the specification in 2008 in response to a global movement to operate data centers more efficiently and save money on energy costs. Building a consensus among its members, ASHRAE’s updated guidelines widened the recommended envelope to encourage more hours of free cooling, decreased the strict humidity requirements, and allowed designers and operators to reduce the energy consumption of their facilities’ infrastructure. A comparison of the 2004 and 2008 versions shows the difference in the recommended operating ranges between the two specifications (see table 1).
Also in 2008, ASHRAE defined four classes of computer facilities to help designers and engineers talk about rooms in a common, shorthand fashion. Each class of computer room has a “recommended” and “allowable” range of temperatures and humidity in order to reduce the chance of environmental-related equipment failures. The recommended range for the 2008 specification is the same for all classes: 64ºF to 81ºF (18ºC to 27ºC) dry bulb, 59ºF (15ºC) dew point, and 60 percent RH. Allowable ranges extend the range of conditions a little more, creating opportunities for energy savings by allowing higher inlet temperatures and less strict humidity controls for at least a part of the operating year.
2011 UPDATE TO DATA CENTER TEMPERATURE RANGES
Between 2008 and 2011, attention to data center efficiency increased dramatically, and ASHRAE responded with another updated version of the data center guidelines in May of 2011. The biggest changes were in the Class A definitions and the allowable temperatures and humidity for this class of data center spaces. Widespread use of airside economization, also known as free cooling, was one of the primary drivers for this update to the data center specifications, with the logic being the wider the range of allowable temperatures inside the data center, the more hours that unconditioned outside air can be used to cool the data center, and the less energy that is required for making cool air for the computers to consume.
Figure 3. Since failure analysis is a statistical problem, it would be normal to expect the number of failures in the population to be between 0.88*X and 1.14*X. Click here to enlarge
The 2011 update changes the old numeric designations into alphanumeric ones. Class 1 becomes class A1, class 3 becomes B, and class 4 becomes class C. The new spec splits class A into four different sub-classes, A1 through A4, which represent various levels of environmental control, and thus different levels of capital investment and operating costs. The A1 and A2 classifications are the same as the old class 1 and 2, but class A3 and A4 are new classes, representing conditioned spaces with wider environmental control limits.
According to ASHRAE, the new A3 and A4 classes are meant to represent “information technology space or office or lab environments with some control of environmental parameters (dew point, temperature, and RH); types of products typically designed for this environment are volume servers, storage products, personal computers, and workstations.”
The new classes have the same recommended ranges of temperatures and humidity, but much wider ranges of allowable conditions. Wider ranges mean that data center operators can choose to run their data centers at higher temperatures, enabling higher efficiency in the cooling systems and more hours of free cooling for data centers with economizers as part of the design.
The differences are evident when the ranges are plotted on a psychrometric chart (see figure 2). On the chart, the recommended range is shown, as are the four class A allowable ranges. Class A3 allows inlet air temperatures as high as 40ºC (104ºF), and class A4 allows up to 45ºC (113ºF) for some period of operation.
|Table 2. Impact of using the ASHRAE allowable range for class A1 spaces when compared with always keeping the space within the recommended range for class A1. Click here to enlarge|
In this version of the spec, ASHRAE actually encourages operators to venture into allowable ranges when it is possible to enable energy and cost savings by doing so. The white paper released with the updated guidelines states, “it is acceptable to operate outside the recommended envelope for short periods of time (e.g., 10 percent of the year) without affecting the overall reliability and operation of the IT equipment.”
TEMPERATURE AND RELIABILITY: THE X FACTOR
But data center operators still hesitate to increase operating temperatures because they don’t know the impact on reliability, and the risk of unknown impact on reliability vs. the benefit of savings on energy costs is too great for most operators. So ASHRAE introduced the concept of the X-factor. The X-factor is meant to be a way to calculate the potential reliability impact of operating IT systems at different temperatures.
There are four aspects about X-factors that are critical to understand: relative failure rates, absolute failure rates, time-at-temperature impact, and hardware failures vs. all failures. These four elements are vital to applying the information in the ASHRAE guidelines to a specific data center operation, and to being able to unlock the potential advantages offered by the updated guidelines.
Figure 4. Don Atwood and John Miner (2008) showed failures between 2.45 and 4.46 percent in blade server populations, but did not break out temperature-related failures. “X” would be a maximum of 45 per 1,000 servers per year if all failures were themally induced, a highly unlikely situation. Click here to enlarge
First, the X-factor is a relative failure rate normalized to operation at a constant inlet temperature of 68ºF (20ºC). That is, if a population of servers ran 7x24 with a constant inlet air temperature of 68ºF, one would expect the number of temperature-related failures to be some number, X. Since failure analysis is a statistical problem, the table in the paper predicts it would be normal to expect the number of failures in this population to be between 0.88*X and 1.14*X (see figure 3).
If the whole population operated 7x24 for a year at a constant 81ºF (27ºC), the table predicts that annual failures would increase to between 1.12*X and 1.54*X. The chart above shows how this might appear on a graph of X-factor vs. inlet temperature. The small overlap in the range of X-factors between the ranges means there is a chance there may be no difference at all in failure rates at these two inlet temperatures.
The second important consideration about X-factor is, what exactly is “X,” the rate of temperature related failures in IT equipment. Intel’s Don Atwood and John Miner (2008) showed failures between 2.45 and 4.46 percent in blade server populations, but did not break out temperature-related failures (see figure 4). “X” would be a maximum of 45 per 1,000 servers per year if all failures were thermally induced, a highly unlikely situation.
Los Alamos National Labs (LANL) did a study of failure in 4,750 nodes of supercomputers over nine years, and categorized over 23,000 failure records. Overall, the study averaged 0.538 failures per machine per year from all causes. Hardware failures ranged from 10 to 60 percent of all failures, or between 0.027 and 0.32 failures per machine per year. In a population of 1,000 machines, this would mean “X” between 27 and 320 hardware failures per year.
Dell’s Shelby Santosh (2002) stated that a Dell PowerEdge 6540 system has an estimated MTBF of 45,753 hours. For a population of 1,000 servers, this would mean 191 hardware failures per year, right in the middle of the range determined by the LANL study.
The point is that the failure data are highly variable and difficult to collect. The data are so variable that it is virtually impossible to measure the impact of raising the average inlet temperature on server reliability. There is no field data available that support a decrease in reliability with increasing inlet temperature. On the other hand, it is easy to demonstrate the savings that result from raising inlet air temperature, chilled water temperatures, and free cooling.
The third key consideration is time-at-temperature. ASHRAE also points out that in order to accurately calculate the impact of higher temperatures, the amount of time spent as each temperature must be calculated and summed for overall impact on server reliability. The example included the warning, “If the server ran 7x24 at 68ºF…” When using outside air economizers to cool the data center, it is likely that inlet air temperature could vary with outdoor air temperature. The net X-factor impact for various temperatures can be estimated by adding the proportional amounts of each factor. For example, if a server spent 100 hours at 68ºF inlet temperature (average X-factor 1.0), and 200 hours at 81ºF (average X-factor 1.34), the combined impact on average X-factor would be calculated by the following:
Combined X-factor = (100 hrs * 1.0 + 200 hrs * 1.34) ÷ (100 + 200) = 1.23
They also note that using outside air might cause servers to spend time with inlet temperatures lower than 68ºF, thus increasing reliability. If the server spent 100 hours at 68ºF and 200 hours at 59ºF (average X-factor 0.88), then the calculation would be:
Combined X-factor = (100 hrs * 1.0 + 200 hrs * 0.88) ÷ (100 + 200) = 0.92
Figure 5. IT equipment failures in this Ponemon Institute study accounted for only 5 percent of all unplanned outages. Click here to enlarge
In other words, the server population relative failure rate would be expected to be lower than running the servers at a constant 68ºF. ASHRAE plots the impact X-factor for a number of cities, estimating that running data centers on outside air in eight of 11 cities examined should have no impact on overall reliability.
A final consideration when weighing the risk-benefit of economization and wider temperature ranges is the number of IT hardware failures in a data center vs. failures from all causes. Emerson Network Power published a Ponemon Institute study in 2011 that categorized outages into IT equipment failures, human error, cooling system failures, generators, etc. Emerson generated figure 5, which shows that IT equipment failures in this study accounted for only 5 percent of all unplanned outages. The Los Alamos National Labs study cited the wide variation in failure modes in their own study and the 19 studies referenced in their paper. In the studies referenced, and LANL’s own studies, hardware problems accounted for 10 to 60 percent of the failures, a huge variation that means the process of failures is largely unknown. It would be extremely difficult in this situation to determine a small change in the number of temperature related failures in a data center.
FREE COOLING HOURS
On the savings side of the equation, the biggest impact on the operation of data centers based on the updated 2011 guidelines is an increase in the number of hours available to data center operators for use of economizers. Economizers provide cooling through use of outside air or evaporative water cooling in order to reduce or eliminate the need for mechanical chiller equipment. The number of hours available to use economizers varies with local weather conditions and the operating environment allowed inside the computer spaces.
The wider ASHRAE ranges mean that more hours are available for airside economization in most locations. In 2009, The Green Grid released free cooling tools for North America, Europe, and Japan that allow data center operators and designers to estimate the number of airside and waterside economizer hours per year that are possible for a given location. Using zip codes in North America and city names in Europe and Japan, the tool allows users to input the operating conditions inside their data centers, then estimates the number of hours per year when 10-year averages for temperature and humidity of outside air would allow economizer use.
Table 2 summarizes the impact of using the ASHRAE allowable range for class A1 spaces when compared with always keeping the space within the recommended range for class A1. This minor shift in operating policy, allowing system inlet temperatures to occasionally run as high as 32ºC (89.6ºF), enables between 100 and 1,600 hours per year more airside economization hours per year, with an average increase of 18 percent more hours.
The most important result from this table is the monetary savings it demonstrates. The increase in airside economization results in savings of $999 to $40,000 per megawatt per year vs. using economizer for only the recommended ranges.
If data center operators are willing to let systems run in the class A3 Allowable range, virtually everywhere in the United States is able to run on 100 percent free cooling year round. Using The Green Grid Free Cooling Tool, 10 out of 16 cities showed 8,750 hours of free cooling available (99.9 percent), and all cities were above 8,415 hours per year (96 percent).
The ASHRAE expanded thermal guidelines whitepaper has a wealth of information, only some of which is covered here. The guidelines open the door to increased use of air and water economization, which will enable average savings of $20,000 to $67,000 per megawatt of IT load per year in the cooling and conditioning of data center spaces, all with virtually no capital investment, if economizers are already in place. With this information in hand, data center operators who simply follow the conventional wisdom of keeping servers chilled like processed meat might not be able to hide behind the reliability argument any more. The expanded guidelines are clear about the potential impacts, and now the risks and benefits can be better understood