The Perfect Storm
Pressure to substantially reduce energy demand by data centers while they provide increased capacity for data processing is creating a perfect storm for the mission-critical industry. Optimizing efficiency is widely accepted as the first step in the quest for energy demand reduction. The popular metric is power usage effectiveness or PUE. PUE is calculated by dividing the amount of power in kilowatts entering a data center by the power in kilowatts used to run the computer infrastructure within it. PUE is therefore expressed as a ratio, with overall efficiency improving as the quotient decreases toward 1.
Members of the Green Grid, an industry group focused on data center energy efficiency, created PUE. Data center infrastructure efficiency (DCIE) is the reciprocal of PUE and is expressed as a percentage that improves as it approaches 100 percent.
Since you can’t manage what you can’t measure, the first step to gaining control is to specify and install power monitoring/measuring hardware throughout a facility’s electrical backbone that is connected to intelligent software and integrated into a globally accessible network for users with “the need to know.”
The three basic steps:
Identify the energy offenders
Arrest the culprits by taking steps to optimize systems
- Measure the results and continue to drive improvement
Since everything we do is aimed at making the entire machine we call the data center as efficient and reliable as possible, surviving the perfect storm means meeting availability requirements. Downtime can be planned or unplanned; however, all downtime detracts from efficiency. An arc flash is the last thing we want in such a setting.
Arc flash from a failed bus joint or cable connection is the most egregious cause of unplanned downtime, so much of what we do while maintaining mission-critical electrical infrastructure is intended to avoid unplanned downtime as a result of this problem.
Arc flash is literally a ball of fire caused by an electrical fault. Ralph Lee’s technical paper entitled “The Other Electrical Hazard; Electric Arc Blast Burns” (1985) provides the following graphic definition, “Current passing through a vapor of the arc terminal conductive metal.” Recognition of arc flash has put emphasis on increased safety standards and changed the traditional approach to maintenance and repair. However, serious workplace injuries and fatalities from electrical arc-flash incidents continue to occur each year.
The traditional approach to designing, building, and maintaining mission-critical electrical infrastructure includes the following elements:
Formal electrical safety programs at facilities coupled with required vendor compliance. Equipping and training personnel in comprehensive safety procedures and the use of personal protective equipment (PPE).
- Annual thermographic (infrared) scans of all connections and bus joints under full load. In order to do this, doors, covers, and internal barriers must be opened or removed. Personnel performing the scan are exposed to the hazard of arc flash and PPE is required, increasing the associated time and cost of the IR scan. IR scanning, like many other tests, is subjective and provides only a “snapshot in time.” Misinterpretation may prevent recognition of a potential failure.
Annual preventive maintenance relies upon the use a checklist of observations, measurements, tests, and adjustments to “tune up” the apparatus/system. It is impossible to perform complete preventive maintenance on energized gear due to the requirements of NFPA70E and inherent limitations of PPE. Therefore, PPE should only be used to take measurements, operate switches or breakers, lock out/tag out gear, and draw or re-rack breakers for maintenance etc. In order to provide a safe environment for maintenance personnel, system design must include “concurrent maintainability” (the flexibility that allows sections to be isolated and secured without interrupting critical loads or impact to system functionality).
Unfortunately, many legacy systems do not contain this design feature and cannot be fully maintained. In this case, maintenance options include;
Don’t perform complete maintenance
Provide temporary hard-wired wrap around circuits
- Place personnel in harm’s way and hope for the best
Maintenance can also introduce the risk of human failure, which may impact reliability. For example, forgetting to re-close the battery disconnect for a UPS system will cause a failure on the next power outage. Maintenance cannot be taken lightly and must be as carefully planned as a war-time offensive.
I asked Steve Fairfax, president of MTechnologies, if there are a quantifiable percentage of failures attributable to human interaction during maintenance in mission-critical facilities. He said, “Math is not always very helpful after the accident. Math can be used to calculate both the benefits and negative effects of maintenance. Such a calculation can be used as an aid in setting optimal policies. Math alone will never provide solutions to problems concerning human action, but it can contribute.”
Steve also said that every situation is different and that potential risks must be factored into each possible scenario. There are some maintenance practices that may increase risk to a facility. Not doing maintenance certainly reduces the risk of human failure, but it is probably not the wisest decision.
- Specify and install fault-tolerant switch gear. Manufacturers are racing to develop designs that incorporate a variety of solutions and features in order to create market differentiation and mitigate the danger of arc flash. Designs may include guides or shutters designed to direct an arc up and out safely away from personnel. New arc-flash detectors trip up-stream breakers or trigger downstream devices to create a bolted fault. Some breaker manufacturers incorporate a “service” switch on circuit breakers that temporarily overrides normal trip settings so that should an event occur during maintenance or testing, the subject breaker will trip as quickly as possible. Installation of special view ports enable IR scans to be done without opening covers and doors. However, the best view port lens material is fragile and may not stand up well over time. A more robust material may not offer the most desirable transmission rate of infrared energy and can degrade over time. View ports may also not allow for a thorough scan of all bus and cable joints or connections due to a limited field of view. This is especially true when bus or cables are “stacked” behind one another.
Traditional maintenance practices need to be re-examined. The good news is that necessity is truly the mother of invention. The demand for more data, increased reliability, and reduced energy will yield new technology and applications.
Recently at DatacenterDynamics in New York, I saw a presentation about an emerging technology to provide permanently installed thermal monitoring of bus joints and cable connections, since these are some of the most common causes of failures resulting in arc flash.
There appear to be some advantages of this approach.
7x24x365 thermal monitoring provides a constant stream of thermal data as opposed to the “snapshot in time” afforded by traditional infrared scans. Combined with comprehensive power monitoring, trend analysis becomes a new weapon in the arsenal of failure prevention.
Doors and covers need not be opened thereby eliminating the need for PPE and reducing personal risk.
Interpretation of the information is simplified.
- Combined with comprehensive power monitoring, a thermal map can be developed and trends monitored so that anomalies can be addressed long before a failure occurs. Combined with comprehensive power measurement/analysis pinpoints inefficiencies so they may be corrected and energy demand reduced.
The illustrations demonstrate this concept.
This technology uses a non-proprietary open protocol for integration with existing BMS, and there is apparently no annual license fee or per point charge.
Information is available globally for those responsible for data center management.
No power supply for thermal sensors means one less point of failure and lower cost per point measured.
Devices are self-calibrating.
Direct contact devices are available where line of site measurement is impractical.
- Depending on the equipment involved, the time periods between traditional maintenance events may be lengthened or, in some cases, events eliminated.
This is just one example of new technologies intended to predict and prevent failures increase reliability while theoretically reducing demand for maintenance downtime.
Our perfect storm may have a holistic silver lining. Like the space program, the mission critical industry may pioneer solutions for use in other segments. The mission-critical industry will find solutions for seemingly insurmountable problems.