I’ve been a mission-critical engineer for close to 30 years and am still puzzled by some things. We all know what an investment of $500 million buys. We invest this money because we think we are buying reliability and business resiliency. After this kind of investment we are enamored with the infrastructure, and we feel confident that it will perform as designed when called upon.
Among the industries that have zero tolerance for error, the ones that stand out are aviation, rail, nuclear power plants, and, of course, NASA. You can call these industries “mission control” type industries, where error can lead to catastrophes, cascading failures, and loss of life, money, and reputation.
Are we falling short in fields that require this type of intolerance for error? As we are already aware, human error causes approximately 60 percent of all downtime experienced by mission-critical facilities. This number is far too high. Today there are a growing number of DCIM tools that can help reduce downtime, but we are just beginning to scratch the surface in moving toward a significant reduction in downtime. We are still many years away from that ultimate goal of ‘zero downtime.’ There have been many recent examples of human error that have caused fatalities:
- The crash of Air France Flight 447 that killed 228 people due to a lack of pilot training in surprise situations
- The head-on collision of a Metrolink train near Chatsworth, CA, which was probably caused by an engineer who was texting, 25 people were killed and 135 injured
- The actions of the Costa Concordia captain before and after the collision that led to the death of 32 passengers
- Colgan Flight 3407, operated under Continental Airlines, that crashed, killing 49 people in the suburbs of Buffalo
Either character flaws or a lack of training played a role in each of these disasters. All could have been avoided if the right people had been in these positions.
Beyond these man-made disasters, we have natural disasters that are even more difficult to cope with. In the wake of Superstorm Sandy, we are once again reminded of how vulnerable our country’s infrastructure is and how large-scale disasters and catastrophes can produce extended downtime.
Sandy left millions without power in the tri-state area, causing untold chaos and the worst gasoline shortages since the 1970s. There are so many ways to defend against these disruptions, from ensuring that the refineries have the appropriate standby or microgrids that are designed to support the critical infrastructure vital to the sustainability of how we live digitally today. How can we expeditiously improve? The critical infrastructure of our country is not something to be left so unprotected. It deserves to be as robust as any mission-critical industry in this country given its importance to health and safety as well as our financial system. The issues surrounding Superstorm Sandy and the associated impact on transportation—auto, air, and rail—crippled Manhattan and some of New York City’s suburbs for days.
Although everybody did the best they could under the circumstances and the first responders deserve accolades, there is no doubt that the effects could have been mitigated with better disaster planning and associated coordination and an inventory of the right assets.
The transformation of this industry must start with the workers. But they need the right tools to be successful, and this is where management comes in. The engineers and technicians are the foundations of success for this industry. Where do we get them? How do we train them?
We are the new mission control, and we need to take a page out of the nuclear, aviation, and first-responder industries to bridge the gap from a 60 percent human error to a statistic that approaches zero. There is a lot of collaboration and work to do. How do we make this industry a profession? How do we develop the right character? How do we ensure continuous improvement? Having a college degree or mastering a trade is only part of the equation. What programs do we need to develop in our industry?