If the ongoing onslaught of outages for major companies such as BA and the NHS can teach organizations anything, it’s that great disaster recovery (DR) software can only be as impactful in a crisis as the larger plan that is put in place around it. Crisis is a relative term, which can range from catastrophic natural disasters, simple but common power outages, to man-made issues stemming from human error or criminal activities such as ransomware. The expert advice is always the same, “organizations need a full DR plan in place to minimize downtime” — but what does this really mean? What should a DR plan look like? Where do they even start?

After conducting extensive support service case reviews spanning a range of industries from across the globe, here are the top five essential elements to effective disaster recovery planning and implementation.

Create a risk profile — know your trigger points

In the event of an IT outage there is a point where the company decides enough of its IT dependencies have been lost to execute its disaster recovery processes. This could be the result of a wide range of causes, as noted earlier, but should remain a consistent benchmark — one that is established far before a crisis occurs.

The point at which a company puts their DR plan into action is tuned to its own circumstances.  To establish their specific trigger points, each organization should perform risk and business impact analysis. The results of this analysis will determine the impact each system or application going offline will have on the company as a whole and at what stage it makes sense to initiate recovery processes.

Documentation and authority management

What often becomes apparent in the event of an outage is that an organization relies solely on the knowledge of one person to guide it through the DR plan. Clearly this is a highly vulnerable approach whereby if that resident expert is unavailable due to resignation, illness, holiday or is otherwise unreachable in the event of a crisis, the potential for prolonging and exacerbating the impact increases considerably.

The answer to avoiding this dependence in a crisis is thorough documentation that includes:

  • What the trigger points are
  • The services covered by the DR plan and their hierarchy of importance
  • Authority management – who is responsible for what during the implementation of the DR plan, including staff and relevant vendors
  • Step-by-step instructions of the process for recovering

All members of the IT team should be required to review, and be able to follow, the procedures laid out in the documentation to ensure minimal interruptions.

The size and span of those involved with the DR plan will vary widely depending on the size of the company in question and their vertical market — for example, many Fortune 100 companies will have staff purely dedicated to business continuity and DR planning. No matter the size of the company, names, titles and contact information for all employees with disaster-declaration and/or disaster-management authority need to be documented, as well as the chain of command to be followed, if a disaster is declared.

Prioritization of applications

It is a common misconception that the person who shouts the loudest about their IT issues should take priority in the event of downtime. A fully structured DR plan will outline the correct prioritization of services based on what has the most financial or reputational impact on the business.

Documenting this step is extremely important as no organization has infinite resources, so criteria must be set to determine where to allocate them first. This should come as part of the trigger point identification process, achieved through the risk analysis. Systems and applications should be classified as critical, important and non-essential, and more often than not, the systems that hold first priority will have a domino effect of sorts on the others.

Regular review and testing

Testing of the DR plan is still not a priority for all companies, even though it is the most critical part of the whole plan — what is the point in having a plan if you don’t know that it works?

Ideally, a company will test their plan once per quarter — if this isn’t viable, then twice per year should be the absolute minimum. In the case of highly regulated industries such as healthcare or finance where compliance is a priority, testing should be undertaken as regularly as monthly. Additionally, any time a major vendor or supplier is added to or altered within the infrastructure, testing of the new system should take place.

The key to a good DR test is that the whole infrastructure is involved, rather than just testing small, disjointed elements, as this does not necessarily serve as a comparison to a real-life outage where the entire environment may be affected.

Netflix is a perfect example of a company that both can’t afford any downtime, but also takes DR testing very seriously.  The company is constantly testing its production environment via its Simian Army in order to ensure its systems can survive common issues without any customer impact. Its Chaos Monkey tool randomly terminates production instances during business hours to force the IT team to continually learn about and address the weaknesses of the company’s systems, ultimately enabling them to build more resilient applications and infrastructure.  

Always be evaluating

Perhaps most importantly of all, a DR plan should be a dynamic and evolving strategy that is open to continuous improvement and evaluation, in a similar fashion to how Netflix use its Chaos Monkey. As IT environments are constantly evolving, continual assessment of the DR plan signals a stark move away from the realms of disaster recovery and into the concept of IT resilience, challenging companies to not only be prepared for downtime, but also to embrace change and data mobility.

If the unfortunate happens and a DR plan is activated, a post-mortem review should always take place. This gives the team executing the plan an opportunity to evaluate what worked, what didn’t, what can be improved, and how downtime can be prevented in the future.

In today’s data-dependent world, continuous availability of key applications and data is the only true way for companies to maintain, if not increase, their competitiveness, while keeping customers at the core of their operational focus.