Unplanned downtime is the bane of all IT departments. With the ever-increasing dependence on real-time information and 24/7/365 transaction processing, downtime strikes at the heart of corporate profitability and core mission success. Its highly visible nature inevitably leads to unhappy customers and upper management questions: Why did this happen? How can it be prevented? Don’t we already have procedures in place? When budget time arrives, however, the redundancies, staffing, and training are hard to come by. The systems get more complex, and the goal remains the same: keep the network up.
Understanding how to define high availability, setting achievable goals, and balancing costs form the beginning of a successful plan for maximum uptime.
Measuring Availability & Costs
Plenty of information exists on high-availability measurements and standards. High availability is expressed as a simple equation of:A= MTBF/(MTBF + MTTR)
where: A = Availability (expressed as a percentage)
MTBF = Mean Time Between Failures
MTTR = Mean Time To Restore
These results of this equation can sometimes be expressed as ‘five nines. or 99.999 percent availability, which equals about 5 minutes of downtime.
That metric alone, however, is not enough to set a target for high-availability planning.
Achieving the lofty goal of five nines sounds like it should be every organization’s objective, but any downtime that can be restored in less than five minutes implies a fully automated system. With an outage of any complexity, it’s just impossible for an individual to recognize, analyze, and diagnose a problem as well as formulate a plan and implement it in five minutes. Just rebooting one server can consume most of a time budget for the year.
Instead, an organization must create a plan that provides reasonable expectations based on:
• How critical the network is
• The negative impact of downtime
• The resources available to increase uptime
When downtime inevitably occurs, remind yourself (and upper management) that it was all part of the plan. A system can be reliable, but failures will eventually happen. Many organizations focus too little attention on MTTR, which measures the time from failure to recovery once the problem is diagnosed. Fixing this part of the availability equation is often much more cost-effective than preventing downtime in the first place.
In addition to availability, several other measurements need to be considered:
• Affected users (number of users who will experience a loss of service): Evaluate an outage that lasts only one minute but affects 1000 users versus an outage that affects one user for 1000 minutes. Which is worse for your organization?
• Potential affected users (if not all users access the system at all times): If a 10,000 subscriber cable TV system fails but televisions are on in only 10% of subscriber homes, then the potential affected users is 10,000 but the number affected is only 1,000.
Calculating Costs
A standard calculation of loss can be summarized as:L = P x T x Cr + Cp
In this equation, P is the probability (expressed as a percentage) that a disaster will occur, C is the cost (lost revenue plus lost productivity) per unit of unavailable time, and T is the duration of downtime. This measurement needs to be applied to each failure point in the system.When all of the possible downtime costs for various scenarios are estimated, the cost of lessening the risks can be compared to the probability and costs associated with those risks.
A Structured Approach
Many resources are available to help develop a cohesive and comprehensive uptime management plan. The National Institute of Standards and Technology (NIST) has produced a Contingency Planning Guide for Information Technology Systems, which is an invaluable document for any IT organization. It outlines a seven-step approach:1. Develop the contingency planning policy statement. A formal department or agency policy provides the authority and guidance necessary to develop an effective contingency plan.
2. Conduct the business impact analysis (BIA). The BIA helps to identify and prioritize critical IT systems and components. A template for developing the BIA is also provided to assist the user.
3. Identify preventive controls. Measures taken to reduce the effects of system disruptions can increase system availability and reduce contingency life-cycle costs.
4. Develop recovery strategies. Thorough recovery strategies ensure that systems are recovered quickly and effectively following a disruption.
5. Develop an IT contingency plan. The contingency plan should contain detailed guidelines and procedures for restoring a damaged system.
6. Plan testing and training exercises. Testing the plan identifies gaps whereas training prepares recovery personnel for plan activation; both activities improve the effectiveness of the plan and the overall preparedness of the organization.
7. Plan maintenance. The plan should be a living document that is updated regularly to remain current with system enhancements.
Minimizing Downtime
In examining key failure points in an operation, equipment, connectivity, processes and staffing, built-in redundancies and automated procedures can shave downtime to a minimum. Fault tolerance and redundancies can be built into most systems and processes. Standard techniques such as RAID arrays, high availability clustering, hot sites, and protection switching can be employed wherever possible to provide alternate resources that can be brought to bear when necessary. Battery backup, standby generators, and diversity routing from multiple telecom providers can also be used to minimize single points of failure.In considering redundancy, adequate staffing and cross-training are often overlooked. In the event of a regional outage due to a hurricane, local staff may not be able to access the necessary facilities, or they may be consumed with personal issues. In these cases, remote access from outside the affected area can make all the difference. If these redundancies can be achieved without human intervention, critical time can be saved. System and environmental monitoring solutions and services can alert personnel to potential problems before downtime occurs and automatically trigger redundant failover switching when pre-determined conditions are met.
Unfortunately, most companies only recognize the lack of preparation when something goes awry. Today, communications networks are the backbones of business operations. Reliability, availability, and scalability have a direct effect on customer satisfaction, employee productivity and revenue generation.
Reliable Power at the United States Senate
The United States Senate on Capitol Hill relies on a Senate Hearing Room Audio Network from Boulder, CO-based K2 Audio to centrally monitor the audio network devices in all of the remote hearing rooms. Part of the system, however, was known to have operational issues that typically required a power reboot-which meant maintenance personnel had to walk to a given Hearing Room and manually cycle power. Looking to maximize the uptime of the Senate’s system, K2 Audio turned to Dataprobe’s iBootBar technology for a remote power management solution.Rodrigo Ordonez, a design consultant at K2 Audio, provides an overview of the Senate’s system and explains the equipment challenge the company was looking to overcome. “Each of the audio systems installed in the Hearing Rooms has its own microphones that connect to a ‘break-out box’. This box houses eight additional microphones and the connectivity to the rest of the K2 system. These boxes require routine maintenance for firmware upgrades, which typically results in a power reboot.
“Prior to the implementation of Dataprobe’s iBootBar,” Ordonez elaborates, “anytime one of the break-out boxes required an upgrade or experienced an occasional equipment lock up, maintenance was called to the room in question to open up the rack and investigate. With maintenance staff potentially located in a different building, and with six or seven break-out boxes per Hearing Room, the Senate decided this situation was no longer workable.”
To eliminate the need for in-room technical support, K2 Audio integrated the iBootBar into the Senate’s audio network. The iBootBar is a multi-outlet, multi-user remote reboot power strip designed to remotely monitor, manage and control corporate and personal computing devices and other electronic equipment. The switch allows remote equipment to be rebooted or power controlled over an IP network using an Internet browser or network management system.