Zinc whiskers have long been a known problem for electronics. Unfortunately, despite significant effort, zinc whiskers haven’t been eliminated in data centers. Many in the industry erroneously thought the only serious source of zinc whiskers was certain types of access floor panels. Facilities without these specific panel types were assumed safe from whisker contamination.
Zinc whiskers have been found on a variety of metal components within all types of facilities, including steel building studs, electrical conduits, suspended-ceiling T-bar grid and hanger wires, and, of course, access floor panels, pedestals, pedestal heads, and stringers. This may be surprising, but it’s not really news.
Zinc Whisker SusceptibilityThe real news is that zinc whiskers are discovered every day on cabinets, racks, and the servers and computers themselves. That’s right! Zinc whiskers may be growing on and in computer hardware.
Zinc whiskers aren’t noticeable, because they are thinner than a human hair and roughly 0.5-5.0 millimeters long. Seeing a single whisker is like looking for the proverbial needle in a haystack, so they are usually found when many are growing together in a group.
Zinc whisker contamination should be considered whenever there are abnormally high failure rates, either catastrophic or less sudden soft failures. The failure rate may peak within 72 hours of performing invasive maintenance work in or around the equipment. Many factors determine the probability of failures due to zinc whisker contamination. These include but are not limited to:
- Age of the source material and therefore the general length of the whiskers.
- Susceptibility to mechanical actions such as scraping, scuffing, and vibration that can cause whiskers to release from the host surface and migrate freely.
- Susceptibility of equipment to whisker failures.
Many users wrongly conclude that only power supplies are susceptible to whisker-related failures. This is likely because a dramatically loud ‘pop’ and a system outage accompany a power supply failure.
Unfortunately, power supplies are not the only exposed electronics in a computer system. There are a myriad of integrated circuits (chips), leads, circuit traces, and other components that may be wholly or partially unprotected by plastic or solder mask. But not everything is protected, and these uncovered leads are just as susceptible as the power supply.
Zinc whisker bridges and shorts of exposed circuitry have the potential to wreak havoc on a system. What happens if leads on the memory bus are intermittently shorted during the critical setup and latch portion of the clock cycle? Perhaps data will be corrupted. Perhaps the corruption will be detected and corrected by error correction algorithms. Perhaps the affected data is really an instruction for the processor. What if the processor tries to load and execute this corrupted instruction? Will the system failover or hang?
Any engineer will agree that finding and fixing intermittent failures is one of the hardest things to do. “If you can’t see it, you can’t fix it.” Many system anomalies are not logged or tracked. If a reset clears a situation, the problem is quickly dismissed as annoying but non-critical. Often, these on-the-floor fixes don’t get the visibility of management. Ask an IT manager if equipment needs to be reset and they’ll say, “…no, why do you ask?” Ask an operator if equipment needs to be reset and they’ll answer, “… of course, all the time, why do you ask?” So, if zinc whiskers are everywhere and affecting equipment, why are they not common knowledge? Most users get their information from personal experience or from trusted sources. If personal experiences are not memorable, it’s human nature to discount and discard them. If resetting a stuck machine is no more memorable than filling a coffee cup, it isn’t remembered. A power supply popping is unusual and memorable.
In the IT world, trusted resources include associates and vendors. Neither one is talking because neither one has an incentive to talk. Users don’t admit they have zinc whisker problems because of fear of criticism and repercussions from vendors. Users are supposed to honor their equipment contracts by maintaining suitable computing environments. Zinc whisker contamination does not contribute to a suitable environment. Likewise, vendors aren’t talking for fear of liability. Vendors are supposed to honor explicit and implied warranties that the equipment they produce and sell is free from defects. If the very equipment is vulnerable and or producing the whiskers, there is a legitimate fear of legal liability. The result of all this silence is customer ignorance about a very serious topic.
What to DoEvidence suggests that zinc whiskers may affect one or more components in 50 percent or more of the racks and cabinets in any given environment. Historically, manufacturers only tested equipment when someone suspected problems. Users only tested when the manufacturers weren’t providing answers. Recently, large users have been willing to sponsor broader, facility-wide testing. Unfortunately, for the reasons indicated above, the specific results of these tests remain confidential.
Until they reach a certain length, zinc whiskers tend to remain connected until they are liberated by mechanical means such as rubbing and scraping. After they reach a certain length, vibration or airflow can free them from a host. Once dislodged, zinc whiskers are free to migrate within the environment. Zinc whisker failures need not be catastrophic. Bit errors, soft failures, and other anomalies may be attributed to zinc whiskers.
Generally, the accepted cure for zinc whiskers is to remove and replace the root source material with an uncontaminated version. It is not reasonable to replace every contaminated piece of equipment, either from a logistics or financial perspective. That doesn’t mean the problem should be ignored.
Zinc whiskers will continue to grow. As they become longer, they become potentially more harmful. Users can’t stop using their equipment nor can they stop meeting the needs of the business through hardware migrations, moves, and rearrangements. Users who want to address the issue proactively should develop a plan for managing the issue through staff training, vendor management, and equipment and facility handling procedures.
Addressing ContaminationBroad zinc whisker contamination takes time to eradicate. One approach to dealing with the problem requires a whole set of new procedures. Users should require:
- All persons who enter the site will be informed of the presence of zinc whiskers and be required to sign a nondisclosure agreement. Violators of the NDA may jeopardize their employment or vendor status.
- All staff and visitors who have any business touching any equipment in the room must be trained and tested on zinc whisker awareness.
- All staff and visitors who have any business working on any equipment in the room must be trained and tested on zinc whisker management.
- Upon passing the zinc whisker management training, all staff and visitors will be required to sign the zinc whisker conduct pledge. This pledge will compel staff and visitors to treat zinc whiskers seriously and to take no action that would aggravate the problem. Their actions will reflect the best interests of the user and reliable computing.
- All cabinets will be examined for zinc whiskers. The results of the examination will be posted on the front and rear door of the cabinet.
- Identified zinc whiskers in or on cabinets will be so indicated with colored adhesive markers. The markers will serve to alert staff and visitors where the contamination is most significant.
- Staff and visitors will be expected, by virtue of their training and agreement with the pledge, to work around the contaminated areas to the best of their ability.
- Require, by way of purchase agreement, all new equipment to be free of zinc whiskers for a period of 36 months.
- Work with all vendors to help understand the problem and develop solutions for future designs.
- Seek to replace (either by purchase or through vendor agreement) any equipment that is expected to be on site longer than 18 months.
- Seek to monitor and manage any equipment that is expected to be retired or replaced in less than 18 months.
- Establish a monitoring program for failures.
- Establish test sites with regular sampling to monitor conditions in the room(s).
- Establish a regular cleaning program for the facility.
- Establish a cleaning program for inside racks.
- Continue with the investigative process to locate and eliminate any additional root sources. All cabinets in the data center should be inspected and tested, as needed, to determine where additional sources exist.
Planning should begin immediately to undertake a thorough investigation, tracking, and remediation program. The program should include:
- Identification of sources
- Management of the sources
- Removal of the root sources, as possible
- Cleaning of the data center to remediate and mitigate the potential impact