Things Are Getting Hotter! A First Look at TC9.9
ASHRAE Technical Committee (TC) 9.9 recently released a white paper titled, “2011 Thermal Guidelines for Data Processing Environments-Expanded Data Center Classes and Usage Guidelines,” (which can be downloaded free from www.tc99.ashraetcs.org). The key data in this white paper will be included in the third edition of ASHRAE’s Thermal Guidelines for Data Processing Environments. The white paper responds to increasing requests from the industry for ASHRAE to expand its temperature ranges for data centers.
Although cooling system efficiency and PUE benefits are well known, operating data centers at higher temperatures can bring additional IT equipment cost, acoustic, and compute impacts benefits. As a result, ASHRAE focused on providing flexibility and options by increasing the number of data center classes to four. These are designated A1 to A4, with Class A4 having the broadest temperature range (41° to 113°F). The biggest tradeoff is IT equipment reliability vs. the ability for the facility to operate full time in economizer mode (compressor-less/chiller-less).
Other highlights of the newly published TC 9.9 white paper includes the effect of expanded envelopes on IT equipment and vendor-neutral IT equipment reliability at varying temperatures that has never been published before.
Obtaining the endorsement of the major IT equipment manufacturers was critical to TC 9.9’s success in creating universally accepted thermal guidelines. Further, these vendor endorsements include legacy equipment. Since most data centers are a multi-vendor and multi-generation environment from an IT equipment perspective, it is important to publish environmental requirements that include all equipment in the data center. To develop these guidelines, the voting members of TC 9.9 (a cross section of industry representatives such as owners, consultants, and manufacturers) reviewed and approved a set of recommendations provided by a subcommittee of commercial IT equipment manufacturers.
The first two editions of the ASHRAE book Thermal Guidelines for Data Processing Environments established two data center environmental classes (Class 1 and 2). The first edition started with a recommended range for both data center classes of 68° to 77°F. The second edition expanded the recommended range (64.4° to 80.6°F). Both of these ranges applied to both new and legacy equipment and were intended to avoid adverse impacts on server cost, acoustics, form factor, performance, and other aspects.
The thermal guidelines book defined the measurement and monitoring points for auditing facility performance/health. Temperature ranges are measured at the inlet to the IT equipment (see figure 1). Having consistent measurement and monitoring points facilitates the ability to provide succinct guidelines.
The white paper creates two additional data center classes for a total of four data center classes and six classes in total. The nomenclature has been changed to avoid confusion. The data center-specific classes are labeled A1 to A4, and the non-data center classes (previously called Class 3 and 4) are now called B and C. Classes A1 and A2 are the same as the original Class 1 and Class 2. The two new data center classes are called A3 and A4.
(For consistency, all tables referenced in this article utilize the same names as those in the white paper. Further, the tables have been modified for clarity within the context of this article and original footnotes have not been included. The reader is encouraged to read the entire white paper to obtain a full understanding.)
The white paper’s Table 3 (figure 2) compares the 2008 and 2011 classes. A key distinction between the classes is in the level of environmental control (varies from tightly controlled to some control). Another key distinction is in the type of server (enterprise server vs. volume server). The equipment environmental specifications are shown in table 4 (see figure 3).
The following are the definitions of the recommended and allowable ranges:
- Recommended. The purpose of the recommended envelope is to give guidance to data center operators on maintaining high reliability and also operating their data centers in the most energy-efficient manner. The recommended envelope is based on IT OEM’s expert knowledge of server power consumption, reliability, and performance vs. ambient temp.
- Allowable. The allowable envelope is where the IT manufacturers test their equipment in order to verify that the equipment will function within those environmental boundaries.
In all data center classes (A1 to A4), the recommended range is unchanged from the 2008 ranges published in Thermal Guidelines Second Edition. The allowable range for A1 and A2 (previously Classes 1 and 2) is also unchanged. By contrast, the expanded Allowable Ranges in the two additional data center classes (A3 and A4) represent a big change.
Prior to this white paper, only a small percentage of data centers have taken advantage of the allowable range. This is probably due to an inadequate understanding of the consequences of operating in the allowable range for any given period of time. To help with the understanding, the white paper defines prolonged exposure with regards to recommended and allowable ranges as follows:
- Prolonged Exposure. Prolonged exposure of operating equipment to conditions outside its recommended range, especially approaching the extremes of the allowable operating environment, can result in decreased equipment reliability and longevity. Occasional, short-term excursions into the allowable envelope may be acceptable.
The challenge has always been obtaining product (e.g., server) reliability data from the IT OEMs. Each manufacturer considers this information to be highly proprietary and valuable. Conditions and industry pressure facilitated the acquisition and publishing of this highly proprietary reliability data. Timing is everything.
The four data center classes have different tradeoffs that consider climate, business strategy/needs, etc. The white paper includes the following critical data so that informed decisions can be made about using the allowable range and which class (A1, A2, A3, or A4):
- A graph showing how the server power consumption increases at higher ambient temperatures
- A graph showing how the server airflow rate increases at higher ambient temperatures
- A table showing how the sound power level increases at higher inlet temperatures
- A table showing how the relative server failure rate for volume servers increases at higher inlet temperatures
Depending on the class, the allowable range upper temperature limit varies from 89.6° to 113°F. As the upper temperature limit increases, it impacts form factor (more space needed for heat transfer), acoustics, failure rate, and server fan power. The advantage of raising the temperature limit is the possibility of eliminating a significant amount of the cooling equipment (e.g., compressors/chillers) as well as the economizer switchover risk that happens when a facility is not on an economizer 100 percent of the time.
The chart in figure 4 is interesting. It provides a good sense of just how significantly the environmental classes have been expanded. The classes include multiple variables and the chart essentially shows the magnitude of the envelope differences. A good way to realize the significance of the expanded ranges is to look at the areas on the chart for each class. There is a huge difference between the recommended range and Class A4.
Temperature is an important influence on reliability. Table C-1 (see figure 5) identifies the impact of temperature on volume server hardware failure rates in terms of a “Failure Rate X- Factor.” It provides values for the lower, average, and upper bounds.
There are numerous variables associated with reliability. For example, every location has a different climate profile and different air quality profile. Also, every business has a different profile, and every application has a different profile.
To provide some reliability data, TC 9.9 chose to use the X-Factor approach. This approach establishes a baseline failure rate of 1.0 for a data center running continuously at 68°F. An X-Factor below 1 means fewer failures than the baseline, and an X-Factor above 1 means more failures. The key is to focus on the X-Factor being a relative failure rate compared to the baseline. The way to interpret this table is as follows:
- Assume 1,000 servers for a particular data center operating environment have a failure rate of four servers across a one-year period if the operating environment ran continuously at 68°F.
- If the data center continuously operated at 59°F over the entire year, an average X-Factor of 0.72 applies. This would mean four normal failures x 0.72 equals approximately three server failures or a reduction of one server failure per year.
- Conversely, if the data center operated at 113°F over the entire year, an average X Factor of 1.76 applies and would mean four normal failures x 1.76 equals approximately seven server failures or an increase of three server failures per year per 1,000 servers.
Table 7 (figure 6) is specific to the city of Chicago and provides a good example of the impact on the relative hardware failure rate X-Factor (e.g., when operating full time in an economizer mode).
This table is assumed that the economizer system has the ability to always deliver a minimum temperature of 59°F (through mixing, etc.). Therefore, the lower temperature bin considered is 59° to 68°F. A simple proportional analysis of the percent of hours in each bin for Chicago multiplied by the average X-Factor for that bin can provide the composite net X-Factor for the year.
In this example, the composite net X-Factor for a typical year of climate data in Chicago is 0.99. In other words, there is virtually no difference to the relative hardware failure rate between operating at a continuous 68°F with strict control. It is important to note that the values in Table C-1 are very valuable and can create significant opportunities, but this information must be carefully modeled and applied. Commitment to particular system architecture should be deferred until the modeling has successfully been run and understood.
The ASHRAE TC 9.9 response to the industry calling for more energy-efficient data center operation has resulted in the groundbreaking release of highly proprietary information directly from the major IT OEMs on the impact of higher operating temperatures on IT equipment. A key component of this information is the disclosure of the relative IT equipment failure rates at different inlet temperatures and the methodology to assess operating under a fulltime economizer mode.
Aligned with the overall ASHRAE TC 9.9 mantra, this vendor-neutral information is provided in the form of education and to provide guidance in order to empower the data center operation stakeholders to make informed decisions.