From the Data Center Trenches: Understanding Mean Time between Failure in the Data Center - Part 1

Lately, I’ve been seeing data center RFP’s come through with requests for the MTBF (Mean Time Between Failure) of a product. Nothing much else is specified other than “what is the MTBF?” This is very problematic because MTBF calculations range the gamut of statistically impractical to just plain black magic. Since so much emphasis in placed on MTBF at times, it is critical to understand the true meaning of this value.

MTBF is typically expressed in hours and is defined as the average number of hours the product will operate in service before experiencing a “failure.” I place “failure” in quotes because the definition of what constitutes an actual failure is critical. As an example, in the data center UPS world, some manufacturers may define anything other than the output inverter on line and supplying the load as a failure. Other manufacturers may accept going to bypass as satisfactory operation and not a true failure. I would hope that all manufacturers consider a UPS induced load drop a significant failure. Regardless, right at the outset, there is the possibility for an “apples to oranges” scenario.

There are numerous ways to predict MTBF including procedures and calculations based on military standards. Linking measured actual field failure rate to MTBF is another popular methodology. There are more to list but the key point here is that each of these methods has their own pitfalls to avoid. Therefore, it is easy to see how, depending on the definition of failure, the method chosen, the assumptions made, and the extent to which pitfalls are avoided, very different MTBF’s can be calculated for the exact same product. The only completely accurate way to calculate MTBF for a product or system is to wait until each and every unit ever placed into operation has failed and then do the calculations. This is obviously impractical so we are left with estimating MTBF to the best of our ability. This can lead to numbers that are clearly nonsensical but that may have some value in a relative sense. That is, the reliability of two products or systems can be compared IF calculated in EXACTLY the same way and IF ALL the same assumptions are made.

I’m going to give an example of this in next week’s blog. In order to do so I will also explain a common method for calculating MTBF using the Annual Failure Rate or AFR. Hopefully, by the end of next week’s blog you will have good grasp of MTBF and be able to avoid its pitfalls in your data center. If you would like to look into this topic a little deeper, please check out white paper 78, “Mean Time Between Failure: Explanations and Standards“.

Domenic Alcaro is the Vice President of Mission Critical Services and Software. Please Check out Domenic's blog "From the Data Center Trenches" at http://blog.schneider-electric.com/datacenter/author/domenicalcaro/

This Conference is designed for anyone involved with 7×24 infrastructures – IT, data center, disaster recovery and network/ telecommunication managers, computer technologists, facility or building managers, supervisors, and engineers.

From the Data Center Trenches: Understanding Mean Time between Failure in the Data Center - Part 1

Recent Comments

Enterprise Patch Management Policy

Hi Mr. Douglas, I really enjoyed this post....

IT management Support and Services

Modular construction discussion is very important for professional...

Good summary with a couple of questions: At...

Blog Roll

Data Center Links

Carlini's Comments

Data Center Design

Data Center Dialog

Data Center Networks

Dean Nelson, Sun

Eye on Blades

GigaOm

Green Data Center Blog

Grove's Green IT

IT BusinessEdge

Loosebolts

Data Center Power Blog

DatacenterPro

The Software Advice Blog

SilverBack

Get our new eMagazine delivered to your inbox every month.

Stay in the know on the latest data center news and information.

From the Data Center Trenches: Understanding Mean Time between Failure in the Data Center - Part 1

Share This Story

Recent Comments

Blog Roll

Get our new eMagazine delivered to your inbox every month.

Stay in the know on the latest data center news and information.