Who Moved My UPS?
Is LES power back-up better?
When it comes to the basic elements of data center critical infrastructure and especially the power chain, I tend toward the conservative approach. This generally follows the traditional approach of increased levels of redundancy (N, N+1, 2N, and 2[N+1]). Most enterprise and colocation data centers use this to one degree or another as the design basis of the electrical and mechanical system redundancies within the data center, or to follow or to meet an organizational based certification. However, I will not belabor contrasting any details of Uptime Institute’s Tier I to IV compared TIA 942 Level 1-4 or any other guidelines or recommended practices.
Notwithstanding the above, I am also open to new ideas and technologies (which seem to occur far more frequently in data centers lately). For example, if you have been reading my column you know I believe and promote “free cooling” and improving energy efficiency wherever possible, but not at the expense of availability (or at the very least, understanding the potential exposure of downtime).
When it comes to the power chain, the UPS has greatly increased its energy efficiency in the past few years to the point that major manufacturers are in the mid-to upper 90 percentile ranges (even while in the traditional on-line double conversion mode). Moreover, many UPS offer an “eco” mode, which effectively bypasses the dual-conversion mode to deliver 99% energy efficiency. This assumes the utility power is clean and stable, but would ostensibly switch back to the double-conversion mode as soon as the UPS detects any power issues (typically within 4 to 8 milliseconds). While 8 milliseconds is within the “hold up” range of most modern IT power supplies, some data center and IT managers may not be comfortable with running their UPS in the eco mode, just to gain a few extra efficiency points. Nonetheless, it is an option that can be activated or deactivated at any time from the UPS front panel.
So what about my “LES is more” sub-title? More recently there has been some increased interest in “local energy storage” (LES) for IT power in data centers for a variety of reasons, but is mostly driven by trying to reduce costs and improve efficiency.
While not considered as mainstream for enterprise and colocation data centers, this is not a new concept; it has been widely utilized by the hyper-scalers starting with Google almost a decade ago. They simply attached commodity low-cost 12V valve regulated lead acid (VRLA) “gel-cells” directly to their custom barebones servers. This allowed Google to eliminate the need and cost for a centralized UPS, and as a potential single point of failure (SPOF) for the hundreds of thousands of front-end web servers serving up Google searches (since their barebones servers only had a single power supply).
Because of their fault tolerant traffic and load-directing software, the server (and the on-board battery), were considered “expendable” and not an SPOF in their IT scheme. In point of fact, all the major hyper-scale operators know that when you have 100,000 servers (or any other devices) there are a statistical and expected number of failures that will occur every day, even under the best operating conditions.
In effect, their overall “availability” scheme is not heavily dependent on the electrical and mechanical redundancy of the data center. It is based on the software to deal with any loss of computing resources for any reason, by being able to divert user requests and network traffic away from a failed server, or group of servers, to other available resources within the same facility. Their IT architecture can even lose an entire data center and divert traffic to other data centers.
Other giants such as Facebook tried a different LES concept via a localized clustered back-up power approach for their first version of the Open Compute Project (OCP). This took the form of a 6-rack pod, tied to a single rack with 48VDC batteries as a back-up power source (not a UPS). This fed the custom-built server power supplies with a secondary 48V DC back-up input, which was normally powered at 277 Volts AC (a single phase of a 3-phase 480 V feed). However, the back-up time was only 45 seconds, by which time they expected the back-up generator to start and pick-up the load. This also followed the Google approach that they could afford to lose a server, a 6-rack pod, or an entire cluster of 100 or more pods, tied to only one of a dozen non-interconnected generators (effectively an “N” power chain). In their scheme, even if one of generators did not start, only a percentage of data center web server capacity was lost, not the entire site (core systems such as databases and network equipment are backed-up by traditional UPS systems).
Microsoft developed an LES design for cloud services, which they contributed to OCP last year. It is based on a hot-swap server power supply similar in size to a typical standard OEM, but coupled with small low cost commodity lithium ion (Li) batteries to provide short duration onboard back-up energy. While there are other aspects of this power system (such as 380 Volts DC power distribution) that are also part of their design, they are not directly relevant to where the energy storage is located.
Of course, this design does not translate well to most enterprise customers, (and the colocation facilities they utilize). Those data centers are typically designed with multiple UPS with multiples strings of data center grade VRLA batteries rated for a back-up time of 15 minutes (or more in redundant configurations). One of the issues with VRLA batteries is that their operational life is severely impacted as temperatures rise (the general rule is life is reduced by 50% for every 18°F degrees above 77°F).
In most of the large facilities cases the batteries are kept in a separate battery room with a dedicated HVAC system to ensure operating life and battery capacity. For smaller data centers and server rooms where the batteries are commonly located next to the UPS, which may be in the IT equipment area, exposure to higher temperatures are a frequent cause of shorter battery service life or even early failure during a power outage. This is particularly true of rack-mounted UPS, where they are constantly exposed to high temperatures. This is an ongoing concern from both total cost of ownership (TCO) and failure during a power event.
One of the advantages of Li batteries is the ability to operate at higher operation temperatures, as well as having much higher power and energy densities per size and weight, compared to lead-acid cells (Pb). Some UPS manufacturers are now offering rack-mounted UPS in the 5 to 6 kW range, with larger sizes to follow. However, one of the drawbacks for Li is much higher cost compared to Pb. Even with only five minutes of runtime at full load, the Li UPS is over twice the price. While longer overall UPS service life is claimed, presumably lowering the overall TCO, the upfront cost makes it a hard sell for most common rack mount applications.
We are now at the stage where the one megawatt UPS is almost a commodity price item and colocation providers are buying mostly based on price, much to the chagrin of the major UPS manufacturers. As a result, the cost of 15 minutes of batteries can exceed the UPS. It seems these LEC designs and some rack size Li UPS products are beginning to concern the traditional data center lead-acid battery suppliers. I just saw a full page ad from a well-known battery vendor titled, “Rethink the UPS.” The focus was on reducing costs by reducing battery back-up times to “30 seconds to 5 minutes,” using their brand of lead-acid batteries for the UPS to lower the size and cost of the battery. Seriously, only 30 seconds of battery runtime when new?
Most U.S. mainstream enterprise customers are still looking for the traditional 15 to 30 minutes of battery back-up. While some data centers have 30 to 60 seconds of flywheel-based energy storage and it has a proven track record, a flywheel’s, runtime does not change as it ages. The idea of reducing the runtime is to lower the size and cost of the battery, which in the current data center ecosphere, which is highly focused TCO, helps keeps lead acid batteries “cost competitive.” To be fair, the company also claims to have improved their lead-acid battery technology to tolerate a somewhat higher operating temperature to reduce the cooling requirements.
However, while watt-for-watt Li batteries are more expensive than conventional data center grade VRLA batteries, they do offer space, weight, and purport to offer substantially greater service life, although the last item is unproven as yet in the data center. If Li costs drop low enough, and they prove to meet the promise of longer life and reliability, they may begin to gain momentum as an option in data center centralized UPS sector. There is also the issue of recycling. Lead batteries are one of the most recycled type and there is a well-developed logistics and infrastructure to handle data center battery replacement and recycling.
I have also seen some discussions that cite designing a data center using LES solutions to reduce the large failure domain that impacts availability for the entire IT load fed from a centralized back-up system such as UPS. While this is true, it also increases the need to monitor and maintain a large number of distributed batteries. This is one of the reasons that small rack-based UPS up to 20 kVA have been commonly available and used in my server rooms and wiring closets. However, their rate of battery failure during an outage is a fairly common issue. Besides the typical lack of individual monitoring for rack-mounted UPS, the VRLA batteries they contain are often operating at 80°F to 90°F or even higher, significantly reducing the battery life to three years or even less.
The result is that battery replacement becomes an expensive process with high ongoing purchase and labor costs, if it were to be deployed in a large data center. While Li-based rack UPS propose to solve this issue with the ability to operate at higher temperatures and project lifetime of seven to 10 years, it substantially increases the total upfront costs of the UPS substantially. Today, a typical colocation 10,000-sq-ft data center hall designed to support one megawatt of critical load containing 200 racks represents an average power density of 100 watts per sq ft and 5 kW per cabinet.
However, while the average power density may be 5 kW, some racks may only draw 2 kW while others may require 8 to 10 or more. With a centralized UPS design, it is relatively easy to deploy more power to some racks at 8 to 10 kW (assuming there is sufficient cooling airflow available).
A centralized single or redundant UPS scheme would require one (or two) 1,000 kW system with 15 (or 30) minutes of battery runtime. Conversely, if 5 kW rack-based UPS units were used for 200 cabinets the initial cost would be three to four times higher. Moreover, while some of the racks may have lower loads (which would become stranded UPS capacity), the remainder of the racks would still be limited to the size of the UPS in the rack (e.g., 5 kW). Even if you consider using the generally homogeneous environment of the OCP designs utilizing a LES battery to backup a 50 kW cluster of OCP server racks, it still faces the stranded capacity issue, especially when used to support the widely varying per rack loads found in the heterogeneous device mixture commonly deployed in the enterprise data center.
The Microsoft LES power supply with onboard Li battery design may offer a more interesting alternative to the centralized UPS. The hot swap power supply form factor is similar to most major OEM servers and other IT equipment. It does not face the limitation of stranded capacity of rack based LES. In addition since the power supply contains the battery it eliminates the UPS (and its battery) as a SPOF. From a cost prospective, it also eliminates the additional space and electrical panels for the centralized UPS and batteries, as well as the maintenance cost.
Clearly Li battery technology has become mainstream for many consumer products, such as smart phones. However, there has been recent news of a major brand of high-end smart-phone Li battery catching fire and exploding apparently without warning and without any apparent exposure to unusual or extreme conditions. The added potential risk of fire from Li batteries crammed into a tiny package and exposed to very high temperatures may deter the major IT manufacturers from offering Li inside the server without long-term testing. This does not preclude them from offering systems with super-capacitors or any other safer battery chemistry, or future LES cost effective technologies.
THE BOTTOM LINE
So where does this leave the centralized UPS in the design of the data center of the future? Clearly there is an existing and increasing divergence in the design and operating conditions for the enterprise data center (and the colocation facilities intended for the enterprise customer) and the hyper-scale designs for the search and social media giants, as well as the some of the largest cloud service providers. They are free to try whatever power (and cooling) scheme or technology they feel is the best at will, as long as they continue to reliably deliver their computing services. Or just follow the latest and ever changing OCP design or the newly formed “Open19” group recently founded by LinkedIn. Hyper-scale cloud providers, such as Amazon, Google, and Microsoft will continue to have the luxury of designing their IT equipment and the facility as part of a cohesive and integrated design strategy to constantly improve energy efficiency and lower data center overall TCO.
In contrast, other data centers will continue to be designed to support standard OEM IT equipment, which so far have not been offered with on-board batteries (or other energy storage devices such as super-capacitors) and will still need to provide centralized power conditioning and energy storage.
And while enterprise “reliability” requirements are different than the search and social media, ultimately, the enterprise and colocation data center designers need to pay close attention to the changing characteristics of the IT hardware, and replication and multisite failover software strategies. When properly utilized as part of a strategic availability scheme, it can help reduce the dependence on high levels of facility infrastructure redundancy, making it more cost competitive with cloud providers.
This year we have seen multiple major airlines experience data center outages that impacted thousands of flights and tens of thousands of customers. While somewhat indirectly related to this discussion, it brought to light how highly dependent the airlines were on the centralized data center infrastructure at a single facility, but more importantly, the fact that they did not seem to have any software-based failover systems to provide distributed multiple-site redundancy.
So until the majority of major OEMs offer internal energy storage as low-cost options across their entire product set, I expect to find the UPS in enterprise and colocation facilities for many years to come.