From severe weather to cybercrime, disasters can happen to any data center at any time, causing power outages and headaches for information technology (IT) managers. No matter the threat, uptime is an ongoing priority in mission-critical facilities where downtime can be costly. Disaster recovery procedures require a lot of thought, study, and practice, but a good plan provides a platform to effectively rebuild and/or restore the IT infrastructure after a catastrophe.
For IT and data center managers tasked with developing these strategies, here are 12 essential steps you can take to influence business continuity via a strong disaster recovery plan.
Step 1: Start talking. Have a chat with the people who depend on you to keep the network running. Start by asking questions at every level of the company to find out what systems are most important to maintain during any type of outage. Ask, “What systems can you absolutely not go without?” “What are your business function priorities?” “What is the maximum downtime your department can live with before it starts hurting the business?” Answers to such questions will help you create the processes needed to recover critical pieces of your infrastructure.
Step 2: Perform a hardware risk analysis. Know exactly what you need to protect and replace. Create a detailed list of hardware, the original cost and today’s cost to replace it, including outside vendor delivery and labor, if applicable. Next, rank how critical every piece of equipment would be if it were to go down by assigning values to each item. The number of values assigned can depend on infrastructure complexity. Some IT managers use a simple “1, 2, 3 approach” whereby 1 is most critical and 3 is least important. Others rank from 1 to 10 or use a color-coding scheme. Pick one that works best for your team and network.
Step 3: Diagram your entire structure. Document how the network is configured so that it can be replicated. Identify network switches, cables, PBXs, PDUs, and routers and have backups. Keep resilient components on hand (and if possible at additional offsite locations) for what matters most, such as server rooms, core networking, and large offices, so critical infrastructures aren’t impacted during a crisis. It’s vital to have redundant power at the ready, as networks that employ redundant UPS backups are more likely to avoid downtime.
Step 4: Separate information into two performance cycles. As an extension of the hardware risk analysis, further divide your assets into “Must Have/Business Critical” and “Temporary Downtime” buckets. You can’t fix everything at once, and this process helps keep procedures organized and aids in determining recovery timelines. Implement either high-availability (HA) or quick-recovery technologies (such as replication or clustering) for critical apps (either between multiple office locations or to an offsite provider).
Step 5: Agree on disaster recovery time parameters. Every disaster is different. Identify disaster types and assign rough response times for each step of the recovery. As a point of comparison, an unplanned power outage might require an hour’s time to notify staff and properly shut down systems. But physical damage to a network due to hurricane flood waters could take days or longer. While there’s no precise way to judge how long it will take to get systems back online, creating a general range helps improve resource and time efficiency.
Step 6: Know your contacts. There’s no time during an emergency to hunt down contacts. Names, landline and cell phone numbers, email, anything that you’ll need to refer to at a moment’s notice should be on a list. Record every vendor you work with currently as well as the customer service information for your network hardware. Some IT managers cross-reference this list with the inventory from the hardware risk analysis. Make sure the list is easily accessible during an emergency. Everyone on the IT and HR teams should have a copy. Keep an additional copy offsite as well.
Step 7: Back up data. Rule of thumb: back up everything. Back up both the server itself and, ideally, specialist applications such as Exchange separately. Refrain from tape backups; power management software allows for graceful shutdown to local disk-based storage and has tools to migrate live workloads into cloud-hosted backup environments. The key is to ensure that stored data is not associated with your physical places of work because you may need that data to create temporary IT networks.
Step 8: Plan to create temporary offsite networks. Prepare to work out of temporary structures should main places of business require restoration. Explore as many off-site options as possible and have hardware at the ready, even if from third-party providers. Critical to the plan is the transportation of hardware and personnel to temporary networks. Work with transportation vendors locally and out-of-state; this is important because a major weather event can take out the transportation businesses close to you as well. Consult with HR to make sure mission-critical staff will be made available.
Step 9: Redirect voice. Telecommunication is perhaps most essential during the first hours of a disaster. It may take hours or even days to set up temporary places of work after a serious event, so it’s a good idea to have a system in place that diverts calls to a different location with minimal notice. Consider diverting incoming calls to a third-party provider who can help explain the situation and provide information as to when regular communications will be restored.
Step 10: Virtualize your operating system. Virtualization adds a layer of agility and resiliency to your IT environment as it abstracts computing loads from resources to “unstick” the IT environment in space. In this way, “whitespace” is wherever the load runs at the time, making disaster recovery easier to implement, cheaper to run, and more reliable when the time comes. Consider integrated power management software to connect agentless systems to virtualization management platforms. This allows complete automated control at the virtual machine level.
Step 11: Add power management software. Combining virtualization technologies with power management software can help reduce the damages associated with downtime and may even eliminate a disaster from happening in the first place. Event-based power management software orchestrates the move of live workloads to safer locations without interruption to users — be it another rack, room, or facility. A dependable power management software package can also trigger a recovery platform, such as VMware’s Site Recovery Manager, to initiate a fully automated relocation of a primary data center to the backup site without the need for user involvement.
Step 12: Use a network monitoring tool. Mitigate threats before they become full-blown disasters. Because problems can happen any time of day, it’s important to install network interface cards that allow for direct connectivity to the network in real time. These systems allow for UPS control across the network via a standard web browser, SNMP-compliant network management system, or power management software.
In a profession where anything can happen, it’s not a matter of if a disaster will strike, but when. That’s why a business continuity plan is so important. It isn’t just about getting everything up and running immediately. Instead, it’s about allowing your business to do enough to function and serve customers without letting on that normal systems are down for the count.
While business continuity strategies require a commitment, they prove essential when devastation hits and plans must be put into practice. Following the disaster recovery steps outlined above will enable you to manage recovery from any disaster, earning confidence from key stakeholders in your ability to protect the business’s infrastructure from unplanned events and keep operations running no matter the disaster.