Stranding And Recovering Capacity (And Assets) In The Data Center
Make sure that your data center has enough HVAC capacity.
The point is that the “rated” capacity of a critical facility assumes there is a match between the physical size, available power, and respective cooling parameters. Otherwise, the actual capacity of the site is limited by the most restrictive parameter (space, power, or cooling). This balancing act is an inherent challenge that each architectural and engineering (A&E) firm meets when they design and engineer a new facility or modify an existing facility.
One reason this balancing act to match space, power, and cooling capacities is so important is the need to avoid what has become referred to as “stranded” capacity. Stranded capacity is installed capacity that cannot be used to support critical load.
In a typical office building this isn’t so difficult. It is fairly straightforward to define (or “program”) how much space is needed, calculate the expected loads (people, “plug loads,” ventilation requirements, solar loads, etc.) and determine the required power and cooling demands. The loads included in a typical office building are well understood, do not change significantly over the life of the building, and can be calculated fairly accurately. The resulting power demands are typically in the 3 W/sq ft range. The HVAC calculations are based on these static internal loads plus the external loads associated with solar and outdoor weather conditions, which even though they change seasonally, are still easily predicted based on over a hundred years of weather data. Special case areas need to be addressed such as auditoriums, kitchens, and retail spaces but even the loads for these spaces are predictable.
Data centers are by their very nature and mission not so predictable. The typical office contribution to a data center is the same such as lighting, people, outdoor loads, etc., but these loads are a small percentage of the overall loads of a data center. By far the largest loads in data centers are the “critical loads” associated with the computer rooms and supporting infrastructure, and these critical loads are expected to change (AKA “be refreshed”) pretty much continuously over the life of the facility. Furthermore, these are not like-for-like changes.
The IT equipment associated with the critical loads continues to evolve towards more compact, higher power demands, and higher cooling loads. Even the general IT topology is subject to large-scale change as sites transition between servers and mainframes, tape storage vs. electronic storage, centralized vs. distributed network architectures, and possibly in the not-to-distant future liquid-cooled (and immersion) electronics. Couple this with the desire to design “modular” (not to be confused with containerized) data centers that can be expanded over time and the predictions get even more difficult.
Inevitably the balance of space, power, and cooling becomes mismatched and deviates from the original design. With careful planning this may not occur at the macro-scale for total available power and cooling, but is more likely to occur at a local scale where available power to a room or space does not match the respective cooling capacity. A good real-world example is a computer room I recently visited, where over time the critical loads had been deployed such that half the room required all available HVAC units (four-out-of-four) to run continuously but the loads in the other half required only one out of the four HVAC units. From a power and, to a large extent, space perspective the room was around 75% rated capacity. From an HVAC standpoint, the heavily loaded half was lacking cooling capacity and was operating at “N” redundancy, in that all available units were required to operate to meet the local demand. The other half of the room had two stranded computer room air handlers (CRAHs) based on a client requirement for N+1 units.
" The point is that the “rated” capacity of a critical facility assumes there is a match between the physical size, available power, and respective cooling parameters.
Otherwise, the actual capacity of the site is limited by the most restrictive parameter (space, power, or cooling). This balancing act is an inherent challenge that each architectural and engineering (A&E) firm meets when they design and engineer a new facility or modify an existing facility. "
The site’s load reports to management were based on overall room capacities and showed the room at 75% capacity with available space, power, and cooling to accommodate additional IT equipment. The site was rightfully hesitant to add additional loads to the space as-is, even in the lightly loaded area to avoid the risk of worsening the situation. This also resulted in stranded power capacity for the overall computer room.
The solution in this case is obvious; relocate critical loads within the room to distribute the heat better and “recover” a redundant unit in the heavily loaded space and pick this load up with a stranded unit. Another possible strategy (that avoided the need to relocate IT equipment) would be to add localized cooling solutions to the heavily loaded space such as in-row coolers, overhead coolers, rear-door heat exchangers, or self-contained racks, but these HVAC devices are typically fed from UPS power, so the balancing act gets even more complicated and does not address the over capacity situation in the low density space.
Another example was demonstrated by an IT manufacturer’s advertisement where a room with racks full of servers gets refreshed with one large mainframe resulting in a large, mostly empty room with a mainframe sitting in the middle. Basically, increased densities can result in stranded IT space. Even if the mainframe can do the same computer work as the room full of servers, it still requires similar power and cooling. And even if it uses only 50% of the power and cooling, if it takes up only 10% of the space and two of them are installed, there is still a mismatch between power/cooling and space.
The root-cause for stranded assets is usually due to operational issues. The reasons are many and varied, and usually due to a myriad of conflicting and competing influences on how space, power, and cooling resources are allocated or managed. Available power can be stranded due to inability to get cooling where it is needed, or when power is allocated to racks and local areas where IT loads fail to materialize.
Power and cooling units get stranded when predicted IT power densities never materialize. Many sites that were designed for 100 W/sq ft or more continue to operate at 50 W/sq ft or less. A representative from the Edison Electric Institute has attended the ASHRAE TC9.9 committee meetings for years pointing out that utilities continue to install electric substations, transformers, feeders, etc. to data centers based on “rated” capacity that rarely ever get used. Couple this with the frequent need for redundant utilities (“2N” requirements) and the stranded assets installed by utilities become significant.
Just as the root-cause of the problem is based on operational issues, the solution for recovering these stranded assets is also based on operational strategies and decisions. Accurate monitoring and management of critical loads and associated space, power, and cooling capacities can help maintain the critical balance. This is a valuable potential of new DCIM monitoring systems. By following operational “best practices” such as sealing raised floors, using blanking panels, and removing unused cabling and other underfloor obstructions the resulting imbalance between power and cooling “useable” capacities can be minimized.
I have seen some sites (especially in Europe and Asia) deploy a variation of the Tier III strategy to utilize the dual-power capability of static transfer switches (STS) and power only one source from UPS power. The other source is fed from emergency (utility backed up by generator) power. All of the STSs are configured such that the UPS source is the “primary” or “normal” feed and the emergency power source as the secondary feed. Such a configuration can support the same critical load with half the UPS modules and the remaining UPS system being operated at higher loads (vs. when the IT loads are equally distributed across redundant UPS systems).
This results in simultaneously eliminating stranded UPS capacity and increasing the energy efficiency of the site. In the rare case of a UPS system outage the UPS may go to internal bypass or the STS devices may transfer the load to utility power. Regardless, the critical load ends up on utility power and the STSs allow a means to transport the critical load over to generator power for added reliability while the UPS system is repaired. The key to using this strategy successfully is to ensure the topology retains the concurrent maintainability capability required of high-reliability sites. Not only does this strategy reduce the amount of stranded power capacity and assets, it allows the remaining UPS system and infrastructure to operate more efficiently since it now operates at higher loads where most electrical devices (transformers, UPS modules, power supplies, etc.) are most efficient.
A very interesting solution that is just beginning to be employed is where enterprises decide to reduce the infrastructure redundancy requirements of critical sites. As enterprises embrace “virtual” redundancy schemes such as virtualization, mirrored processing, and redundant data storage in physically separated spaces, the need for highly redundant infrastructure (such as 2N and “2(N+1)” topologies) can be relaxed and dropped to lesser redundancies such as N+1 or N+2 topologies. Again, the key is to retain the concurrent maintainability capability. The trade-off is a less expensive facility with the same load capacity, higher efficiency, and less stranded capacity but is not fault tolerant during some maintenance activities (from an infrastructure standpoint). This results in far less unused or partially used capacities (and assets) and in existing sites can allow for the recovery of existing assets to support additional site critical load or for redeployment to other sites or even resale on the “gray” market.
Many if not most new data centers are engineered and designed such that additional capacity can be installed as needed. If a computer room has power available but needs additional cooling, then provisions are in place to “plug-and-play” additional HVAC units, and vice versa. If IT power (and heat) densities do not materialize, it is possible to add additional computer room space to avoid stranding available power and cooling. In other words, space, power, and cooling can be deployed incrementally so as to maintain the balance.
These are just a few examples that demonstrate how even the best designed sites can encounter stranded capacities as they evolve. There are many causes for stranded capacities and as many solutions. Most sites, even the best engineered and designed, will eventually encounter situations where the balance between space, power, and cooling gets mismatched. The degree of and severity of unbalance can be minimized by compliance with industry best practices, coordination, and communication between the IT and facilities management departments, and adhering to the original A&E’s operational strategies for deployment of IT equipment and use of the critical space, power, and cooling resources. Resolving these issues and recovering stranded capacities and assets can be complex and difficult. Picking the best solution and minimizing the risk to critical operations during implementation may require specialized skills and expertise available from select A&Es and facility management consulting firms.
The author would like to acknowledge Ethan Thomason, vice president at Primary Integration, for his help with the article.