Redundant Rack Power Distribution Can Lead to Failure
Can redundancy create a false sense of security?
Welcome to the new decade, where uber-scale is the new normal. I can’t even keep track of the new data centers announced last year. We seem to have mastered the art of data center power delivery systems at scales that were previously unimaginable, as 20-, 50-, and 100-MW campuses are being built in record time. Conversely, the edge is still being defined by power and size. And all of this is driven by an endless demand for more cloud services and the promise (or threat) of 5G applications.
But facility design still comes down to the basics: providing uninterrupted power to hundreds or thousands of racks full of IT equipment.
Reliable redundant IT power is a basic expectation for virtually every data center application. Yet, internet colossuses, such as Amazon, Facebook, Google, and Microsoft, rely on software-based redirection for user queries in the event of a failure. Nonetheless, they still utilize redundant A-B power circuits for their critical core IT infrastructure (database servers, storage systems, routers, switches, etc.).
Utilizing IT equipment with dual power supplies is a common practice for both enterprises data centers and colocation facilities. As a result, redundant A-B power distribution circuits to the rack have become a popular provisioning option in many colos. However, power loss or misconfiguration of power distribution by IT administrators can be a major contributor to downtime.
In most cases, dual-corded IT equipment has redundant dual supplies that share the load at approximately 50% each when both supplies are active. However, if either supply fails or has lost input power, the remaining one must draw 100% of the power required. This is a normal failover scheme, but misunderstandings can lead to improper implementation.
A common issue is an overload of the power strip circuit breaker or the branch circuit breaker that feeds the receptacle to the power strip. Even if there is a metered power strip or a manual measurement performed at the distribution panel, it’s impossible to keep track of the highest maximum current being drawn on a branch circuit. The result is like playing Russian Roulette whenever a new piece of IT equipment is plugged in. Even if the added server doesn’t immediately trip a breaker, it’s possible the circuit is near (or at) capacity. Under heavy computing loads, power draw will increase, causing the circuit breaker to trip from an overload.
Metered power distribution units (PDUs) may indirectly lead to a false sense of security. Locally metered PDUs have been available for over 20 years, but the need for improved energy efficiency has given rise to “intelligent” PDUs.
Nonetheless, intelligent PDUs in and of themselves cannot help if the information they convey is not fully understood. Circuit loading rules for redundant A-B power at the rack level are often misunderstood. Redundancy is only achieved when the combined total current (A+B) of each feed does not exceed 80% of the rated individual circuit value at the maximum projected power draw. Some IT administrators falsely believe they are safe at 50% load per side under average conditions.
For example, a 20-A branch circuit should only be loaded to 16 A. This means that for racks filled with equipment with redundant A-B power distribution, the sum of the two branch circuits should not exceed 16 A. Ideally, the total load should be equally balanced across the A-B set, which leads to the rule of thumb recommendation (each branch circuit should only be loaded to 40%). If those numbers are exceeded, when one side is lost for any reason, the load shift may trip the remaining active circuit breaker, causing a cascade failure.
Moreover, bladeservers can require as many as six power supplies. An industry trend calls for higher power per rack, so it’s become more common to use three-phase power for each rack. The same rules apply here — if any of the phases exceed the 40% level, it can result in a loss of redundancy that leads to a cascade failure. This is a common issue that occurs, since it is difficult to manually monitor for each rack. The only way to avoid this is through real-time remote monitoring of every branch circuit with threshold alerts that warn of potential overloads.
Wide swings in IT load current is no longer an unlikely or extreme scenario. The quest for energy efficiency in IT equipment has significantly changed the load profile and the idle-to-maximum power range for modern server. A few years ago, a typical server would idle at 40% to 60% of maximum power. The newer servers have a much wider ratio and may idle at only 20% to 30% of maximum power (the peak current could be four to five times the idle current). As a result, a manual survey “snapshot” current reading is nearly meaningless without a valid correlation to the servers’ computing load status.
Before we dive into more details, remember that VA (volts x amps) is “apparent power” and watts (volts x amps x power factor) is “real power.” For simplicity, we will ignore the IT power factor, since virtually all modern IT equipment is power factor corrected to about 0.97 or higher.
However, it should be noted that circuit breakers do not understand power (kVA or KW); they can only sense current (amperes). This exposure can be overlooked when running near the limit of the rated power of a PDU or circuit. There is an allowable voltage drop that inherently occurs in a branch circuit, which increases as the current increases (IR voltage drop: current x resistance of the conductors). This drop can normally range from 1% to 3%. The maximum total voltage drop for a combination of both branch circuit and feeder should not exceed 5%.
For example, if the voltage at the server rack drops by 4% (208 V to 200 V), the current will rise by 4%. If the expected maximum current is only based on the “nominal” 208-V circuit voltage and is near breaker capacity, this lower-voltage condition may increase the risk of a breaker trip.
Many colocation facilities provide branch circuit monitoring at their own power distribution systems to monitor and protect the equipment and to advise the customer.
Rack power demands in the range of 2 to 5 kW could be supplied by one or two single-phase circuits. Beyond that, it becomes more complex and physically cumbersome to run a high number of single-phase circuits and PDUs in the rack. For example, a 30-A circuit can deliver 24 A, which can supply 2.8 kVA at 120 V or approximately 5 kW at 208 V.
Even at this power capacity, the single-phase, 30-A rack PDU has a hidden exposure. While there are many rack PDUs rated kVA, they require onboard circuit current protection for the receptacles. This typically results in two or more 20-A circuit breakers for each bank of outlets. For example, if the IT equipment specified for your rack requires a maximum draw of 20 A at 208 V, a 30-A single circuit is required. This means the maximum power that can be delivered to that rack is approximately 5 kVA. To deliver 9 to 10 kVA redundantly requires four 208-V circuits, which can crowd the back of the rack and block airflow.
Even within higher current capacity rack PDUs, hidden exposures exist. While there are many rack PDUs rated above 5.7 kVA , the per circuit current protection is limited to 20-A breakers for server receptacles. These receptacles must be segregated into groups with each group protected by a breaker. Therefore, in the case of a 30-A PDU, it would have two groups of receptacles, each protected by a 20-A breaker and being fed by a 30-A input breaker in the PDU (as well as the upstream 30-A branch circuit breaker).
For higher power, consider three-phase power distribution to the rack. This can be configured as Delta (four wire: three phases + ground) to provide 208 V phase-to-phase, or Wye (five wire: three phases + neutral + ground) to provide 120 V phase-to-neutral and 208 V phase-to-phase. Three-phase Delta adds to the exposure, since the PDU has three groups of 208-V circuits that are fed by two-pole breakers, connected across two legs of the three phases (L1-L2, L2-L3, and L3-L1). While this ensures the breakers protect and limit the maximum current for the individual receptacle groups, it does not balance or prevent the combined current on any given phase from exceeding the maximum.
Rack power levels have risen well beyond 2 to 5 kW per rack. Stacking 40 one-U servers in a rack is not uncommon. However, while each server may draw 150 to 250 W under average or idle conditions, they can ramp up to 400 to 500 W at peak utilization. Average rack power demand can range from 6 to 10 kW but may reach 16 to 20 kW under heavy computing loads. The same holds true for four to five bladeservers in a rack.
When going above 10 kVA, branch circuits can range from 40 to 60 A (single- or three-phase at 208 V), and the PDUs may require up to six two-pole breakers. Each of these breakers, as well as the main input breaker, must have current monitoring to properly balance and manage the IT loading. This is where data center infrastructure management (DCIM) or PDU monitoring software plays a critical role.
While we have been discussing the typical U.S. distribution voltage (120/208 V), European distribution voltage (220/380 V, 230/400 V, and 240/415 V), provides double the power at the same current. This can save cable and PDU space and costs. Existing IT power supplies can operate 120, 208, or 240 V but are more efficient at 208 and 240 V.
And lastly, while we have been focusing on power, cooling high density racks is no small task. As both the number and size of PDUs increase, airflow issues increase. This, coupled with warmer cold aisles (70° to 80°F) and greater delta T (20° to 40°) for IT equipment, results in much higher temperatures at the back of the rack (100° to 120°-plus). Many older PDUs were only rated to 104°, and some may contain circuit breakers that trip due higher temperatures or cause the electronics metering to fail. Vendors now offer next-generation PDUs rated for 140°. Review your PDUs and consider replacing them at your next IT or facility refresh
The Bottom Line
The rack PDU may be the last part of the power chain, but it’s a critical link. While facility engineers are responsible for the entire electrical power chain, IT and operational staff typically request redundant branch circuits to the racks, select the PDUs, and install the equipment. Ultimately, most statistics show that human error, not hardware failure, is the root cause of most data center power failures. Smart PDUs, coupled with monitoring software, maintain operational status and minimize human error.
In most cases, power issues can be traced back to misunderstanding redundancy and how it was implemented. As the saying goes: Knowledge is power — nonetheless, a little knowledge can be dangerous.