Monitoring Averts Catastrophic Computer Room Meltdown
For much of its 23-year existence, Salient Corporation, a software development firm headquartered in Horseheads, NY, relied on a third-party monitoring service to protect its computer test lab from changes in temperature that could lead to costly damage and downtime. But that system couldn’t prevent a situation where the air conditioning system failed and the interior temperature rose to about 130 F, causing more than $25,000 in equipment damage. Afterwards, Salient realized it needed a more dependable in-house solution to avoid similar losses in the future.
As part of its service, the monitoring provider would contact the IT staff when conditions warranted, and Salient’s team would take action. But the service’s capabilities were limited to monitoring temperatures, not other environmental conditions, and it would only contact Salient’s staff after an alarm had sounded. The service alerted them to a high-temperature alarm, but it was unable to pinpoint the current temperature. Although it was a simple arrangement, in some instances Salient felt the information available from its provider was not enough.
In early 2007, the failure of the air conditioning system exposed the flaws of the monitoring service, and as a result, Salient saw a significant price tag for repairs.
Founded in 1986, Salient Corporation develops business management software for its more than 300 corporate and government clients, including multiple Fortune 500 companies. Salient has more than 35,000 users in 53 countries. The company’s headquarters houses a computer-testing lab where developers load-test new software and its functionality. The company’s mainline computer system is located near the testing lab but in a different room that operates separately. Both of these rooms require monitoring.
“Our backend engineers use the lab all the time to test our software applications being developed,” said Rodney Hall, infrastructure manager for Salient. “It’s a valuable piece of real estate for what we do.”
The lab holds more than 250 pieces of equipment, including more than 50 servers, which Hall refers to as the “backbone to the entire testing system.” It is no wonder that Hall reflects on what happened in early 2007 as “catastrophic.”
Late one Saturday, the testing lab’s dedicated air conditioner failed, and the second air conditioning unit failed as well. But Hall and his colleagues did not receive a call from the monitoring service until the following day. By the time they arrived to respond, it was too late.
“We came in on a Sunday and the room temperature was well over 130 degrees,” Hall said. “The heat damaged many machines, melted some of the machine casings, fan shrouds, and even unsoldered chips on memory boards. It was boiling hot.”
The room’s heat even caused the thermometer to melt, permanently displaying a temperature well above its 120 F limit.
Hall estimates the company lost more than $25,000 from that single incident. “That’s the bare minimum. To this day, we’re still discovering components that were damaged in some way. It was absolutely catastrophic.”
The company learned from that experience and was able to avoid a similar outcome during recent trouble with its air conditioning units and power supply. Now Salient relies on a more effective remote monitoring system as its first line of defense. Hall and three members of the IT staff supplemented the outside service with the IMS-1000 infrastructure monitoring system from Sensaphone.
Hall’s team first spotted the IMS-1000 at the 2007 Interop tradeshow in Las Vegas. “We wanted more effective coverage, and even with its simplicity, this unit gives us that,” he said. “A lot of the other options are for major infrastructures. We run a lean organization and $100,000 is a big investment to us. This unit is in our price range and does exactly what we wanted it to.”
Because the IMS-1000 combines environmental monitoring, physical security, network monitoring, and data logging into a single system, Hall can monitor much more than temperature change. The stand-alone unit features an internal battery backup system and a variety of alarm delivery options that work independently from a computer network. The IMS-1000 has an internal Web server and is also fully SNMP-manageable. With the IMS, Hall doesn’t have to pay a monthly fee.
The unit uses eight external sensors to monitor the primary environmental culprits leading to server malfunctions, including temperature, power, humidity, smoke, fire, and water on the floor. TCP/IP port service monitoring exists for up to 16 network devices or port services, generating ping requests and verification of services. Environmental sensors that identify unresponsive network ports or detect conditions exceeding set ranges initiate an alarm notification process. IT departments can also select built-in phone modem and voice communication options.
At Salient, the unit took Hall less than 45 minutes to install, and it almost immediately began returning dividends. Hall installed the IMS in June, and it issued its first alert over the July 4th weekend.
“It paid for itself right then and there,” Hall said. “On July 5th we lost a compressor for an air conditioner in our main server room. The temperature had gone up considerably, but I received a notification in time. I came in and shut down a bunch of servers that we did not need running, and that helped the room return to the proper temperature. We had the problem taken care of before the monitoring service even had to call us. The Sensaphone alerted us well in advance of it becoming a major problem.”
In late July, a second incident occurred-this time a power outage. “We kicked a circuit, and it shut down our entire infrastructure. The switching went down and so did our Internet connection. It was 10:30 on a Monday night when I received an alarm notification. I ran back to the office, reset the circuit, diagnosed the problem, and fixed it to bring everything back up. We had about 20 minutes of downtime versus coming in the next morning to discover it had been out all night.”
Even with his recent experiences aside, Hall said the Sensaphone remote monitoring system outperforms the existing outside monitoring service. The IMS-1000 calls the contact numbers for alarm condition, and it keeps calling until someone responds. “The service calls everyone on the list, which means several of us may respond to the same alarm - not exactly efficient.”
Another advantage, Hall said, is the ability to log in to the IMS-1000 unit and proactively monitor the conditions. If a situation is serious, the system allows Hall to log in and remotely shut down systems if necessary. “I have more control over everything the Sensaphone monitors,” Hall added.
He went on to add that the Sensaphone IMS-1000 has helped the company recover more quickly from the previous disaster, allowing senior executives to concentrate on what’s most important - growing the business. Hall said he fully expects to install the IMS system in a new computer room currently in the planning stages, entrusting three rooms total to Sensaphone’s care.