Equipping servers with dual power supplies for improved reliability is a common practice in today’s mission critical data center environment. However, if improperly implemented, this practice can increase the likelihood of power failure. A Tier IV data center application will include two completely independent power paths, each including the six basic power components:
  1. Utility power source(s) and main power panel(s)
  2. Back-up generator and automatic transfer switch (ATS)
  3. Uninterruptible power supply (UPS) and maintenance bypass panel
  4. Power distribution unit (PDU) (or sub-panel from UPS)
  5. Rack-level PDU
  6. Server’s internal power supply (PS)

Figure 1. A Tier IV fully redundant data center.

Each path and all the items in the path must be capable of supporting 100 percent of the entire data center load. This represents true 2N redundancy, which means that no single point of failure will interrupt the operation of the data center equipment (see figure 1).

Of course, not every data center is a Tier IV facility. All operators would like to have complete power system redundancy, but cost usually forces some trade-offs. This usually means that although servers have dual power supplies, the rest of the two power paths are not completely independent (see figure 2). The more common scenario is that each of the server’s PS cords are plugged into a different rack-level PDU. This scenario creates a sense of redundancy for most administrators. In reality, this is where the hidden exposure to power problems starts. 

This seemingly simple and common practice is the potential cause of power failures in the data center (see figure 3). In most cases, the dual supplies will share the server load at approximately 50 percent each, when both supplies are active. However, if either PS fails or has lost input power, the remaining PS must draw 100 percent of the power required.

Figure 2. Single points of power path failure

Servers are normally installed and operated with both rack-level PDUs available. Typically each PS would only draw 50 percent of the server’s power requirement. Normally the PDU load is less (again hopefully) than the trip value of the circuit breaker that protects it. In fact, even if the PDU has a current meter, most administrators would think they have the capacity to add more servers if they are only at a 60 percent power level (see figure 4).

This thinking leads to a classic cascade power failure. The same problem results if an additional server or other equipment overloads the PDU load past the tripping point of either PDU.

Since many racks do not have metered PDUs, adding servers can be risky because the administrator has no way of knowing if the next server will overload the PDUs.

Figure 3. Better reliability is achieved through redundant PDUs.

The only way to safely implement a dual server PS and dual-rack PDU is to keep the loads below 40 percent of the face rated value of the rack PDU or path. In addition all circuits must always be protected by a circuit breaker. The UL and NEMA mandated codes limit draws to 80 percent of the rated value (see figure 5). In a dual PDU rack the entire equipment load should not exceed 16 A for the rack. Therefore each PDU should normally only have an 8-A load on it in order to avoid a potential cascade overload and resultant compete rack level power failure.

In a multi-phase PDU this is even more important, since it has become very common to use a three-phase 208/120-volt PDU populated with three groups of single-phase 120-volt outlets, being fed from a singe three-phase breaker. In this scenario, if any phase exceeds the rated current, the breaker will trip, and all three phases will be dropped, potentially resulting is a loss of power to the entire rack (see figure 6).

Figure 4. Should one of these PDUs fail, the remaining unit will be overloaded.

As mentioned earlier, even those administrators who do have metered PDUs, do not realize that once they go past the 40 percent power level they are in danger of have a cascade power failure. Moreover, as servers are upgraded and added all the time, it is easy to see how the exposure continues to increase with no warning, until a problem occurs. Then everyone involved is baffled why power was lost, because everyone thought they had “redundant” power.

Figure 5. Properly loaded dual PDUs provide redundancy when one fails.

Data center owners and operators should review the rack-level current draw at each PDU, which may require upgrading to metered PDUs. Remote monitoring (via SNMP and/or web) can reduced the time and cost of manually monitoring dozens or hundreds of PDUs, by sending SNMP traps to management software. In addition, thresholds (i.e. 75 percent) could be set in monitoring software to send automatic alerts to administrators to warn them of potential power problems before the circuit rating is exceeded.

Figure 6. Typical three-phase PDU circuit protection

In any case, true redundancy requires that either path can sustain 100 percent of the load if the other path fails. Reviewing and documenting the existing load structures and proactively monitoring and managing the load levels on all PDUs as well as all the other elements of the power path is essential. Changing out PDUs can involve some downtime. However, like any power path work, some downtime may be required if there is no true 2N power path. The choice – some planned limited downtime or an unplanned surprise shutdown seems clear.