One issue getting its fair share of media attention these days is the significant cost associated with IT infrastructure failure. With everything from power grids to emergency management systems based on data center performance, it’s no wonder there’s heightened awareness of operational and energy costs.

While it doesn’t get the attention it deserves, a top reason for IT failure is caused by ineffective thermal management. Back in 2010, Wikipedia experienced significant downtime from server shutdowns resulting from overheating. The company’s European data center shut down for two hours — a failure that affected all global sites.

Why thermal management? Because data centers generate a tremendous amount of heat, the standard practice is to overcool to effectively keep temperatures within normal ranges. Data center managers would rather have temperatures far below necessary thresholds than experience costly failures associated with overheated equipment. This is a dangerous and wasteful practice.

Actually, it’s limited visibility into real-time thermal conditions that causes many IT managers to create excessively lower ambient temperatures — to be on the safe side. While this may seem wise this fail-safe strategy is shortsighted. Each degree of reduced temperature exponentially increases the cost of powering cooling units. For this reason, real-time visibility into power usage effectiveness (PUE) is crucial.

Take the real-world example of an IT managed service provider supporting multiple customer tenants collocated in a single facility. As expected, the data center was excessively cooled. But this safety cushion was driving other problems. As all energy related costs are charged to customers, cooling inefficiencies mean higher operating costs and decreased customer satisfaction.

Letting cooler heads prevail, the customer integrated a full-featured data center infrastructure management (DCIM) suite, offering real-time views into power and cooling metrics and allowing the chilled water supply to be increased by one degree each week. Additionally, graphic display of the thermal infrastructure state in comparison to established thresholds provided added assurance that increased efficiencies had no impact on production reliability. The new visibility allowed chilled water system temperature to be increased from 45 to 55 degrees over a six month period — decreasing related power consumption from .53 kW/ton to .32 kW/ton.

The benefits went even further. Keeping an eye on efficiency, the company began to undertake a massive data center build-out — increasing floor size by 80% and load densities by 74%. While higher load densities generally increase PUE ratios, the organization was still able to increase environment temperatures — resulting in a PUE decrease of 17%. More effective thermal management resulted in an overall reduction of data center cooling requirements —creating annual savings of more than $285,000! How’s that for savings?

One thing is clear - companies today are far too concerned with the “big picture” of catastrophic IT infrastructure failure. To prevent issues and stay ahead of the curve, many do more harm than good. The critical step towards data center efficiency is a real-time view into all core components of the infrastructure — providing a closer look at everything from assets and facilities to power and cooling.

No doubt, DCIM is really the only true way to ensure companies can keep their cool!