Data Center Resilience

To make a data center truly resilient used to require significant hardware redundancy. Two (or more) of everything. Two power chords. Two servers. Two power sources. And, generally speaking, two whole data centers. Having a backup (and a backup to the backup) for literally everything probably helped with peace of mind, but definitely not the bottom line. All that equipment that required space, power and cooling, (not to mention staffing) adds up to twice what you ever plan on using, and also adds up to be prohibitively expensive.

Within the banking and trading sector, perhaps, “prohibitively expensive” redundancy may still be affordable, and well worth the expense when you consider that the cost of downtime in some banks has been estimated at six million dollars a second.

With the unrelenting and exponential growth of data today, (recent figures from ABI Research show that the Internet of Things grew to over 16 billion wireless connected devices by the end of 2014, and is expected to grow to 40.9 billion devices by 2020), if every data center needs to be doubled, that’s a lot of expensive capacity, which is seldom or never used.

So, how can you reduce the degree of redundancy and the cost of idle resources, when addressing the need for resilience for dependability?

Some enterprises find the answer in software, rather than hardware solutions. They are building fault tolerance into their software to strengthen their data center’s resilience through load balancing, virtualization and other techniques. This essentially means moving workloads around to handle failover situations effectively. But what about systems that don’t have that kind of flexibility, and need 24/7 uptime, no matter what? It’s critical for these facilities to be able to run “What If?” simulations.

TechTarget defines data center resilience as “The ability of a server, network, storage system or an entire data center to continue operating even when there has been an equipment failure, power outage, or other disruption.”

To assess how resilient you are, you need to be able to assess your data center: how vulnerable is your system to failure(s)? How many single points of failure do you have? If a device is out of service, either purposely for planned maintenance, or through human or machine error, what will happen? Where will that load go? What else might fail as a result? Where are your weakest links? Your system can probably handle one failure, but what if that one failure starts a cascade?

To answer these questions, you need a data center infrastructure management (DCIM) monitoring system with the ability to run “What If?” scenarios. Empowering a data center manager with the ability to simulate an equipment failure “virtually,” without actually changing the physical environment and possibly endangering the workload or the facility, can reveal the potential impact of that failure on the rest of the system, and allow the user to take proactive steps to avoid the problem. The ability to simulate a power chain catastrophe helps avoid a real one.

Recognizing the weaknesses in the system grants the ability to maximize uptime by proactively analyzing the result of potential failures, and mitigating them even before they start. It also makes planning scheduled maintenance much easier and safer.

But that’s not all data center managers can gain from “What If?” scenarios. When the user is able to see the entire picture of the power chain, end-to-end, they know what is connected to what. For instance, a manager can look at a server, or a cabinet, and trace where its power comes from, through the various circuits and power distribution units, across different phases, on different “sides” of the power chain. Conversely, a user can look at power equipment and immediately know what it’s feeding. This reveals unused power capacity and simplifies the process of deciding where to place new assets safely. More importantly, it reveals connection inconsistencies and verifies redundancy enabling informed planning and the ability to get the most out of your physical and power capacity.

The ideal tool that enables “What If?” simulations can provide mission critical IT and facility managers with sufficient information to help optimize current operations and avoid critical system failure. And that’s a “must-have” solution data centers can’t afford to ignore.