Troubleshooting is a skill that experiences gain only through operational pains. As such, every IT professional needs a troubleshooting protocol specific to their given environment, application, and dependent services. A basic troubleshooting workflow should look like the following:
- Define the problem
- Gather and analyze relevant information
- Construct a hypothesis on the probable cause for the failure or incident
- Devise a plan to resolve the problem based on that hypothesis
- Implement the plan
- Observe the results of the implementation
- Repeat steps 2-6
- Document the solution
Steps 1 and 2 usually lead to a world of pain. First of all, you have to define the troubleshooting radius — the surface area of systems in the stack that you have to analyze to find the cause of the issue. Then, you must narrow that scope as quickly as possible to remediate the issue. Unfortunately, remediating in haste may not actually lead to uncovering the actual root cause of the issue. And if it doesn’t, you’re going to wind up back at square one.
You need to get to a single point of truth with respect to the root cause as quickly as possible. To do so, it’s helpful to combine a troubleshooting workflow with insights gleaned from tools that allow you to focus on a granular level. For example, start with the construct that touches everything: the network, since it connects all the distributed systems. In other words, blame the network. Next, factor in the application stack metrics to further shrink the troubleshooting area. This includes infrastructure services, storage, virtualization, cloud service providers, web, etc. Finally, leverage a collaboration of time-series data and subject matter expertise to reduce the troubleshooting radius to zero and root cause the issue.
If you think of the troubleshooting area as a circle, as the troubleshooting radius approaches zero, you get closer to the root cause of the issue. If the radius is exactly zero, you’ll be left with a single point, and that point should be the single point of truth about the root cause of the incident.
Troubleshooting: forcing function on the IT career path
IT troubleshooting efficiency and effectiveness are critical to uncovering the root cause of incidents and negative events in any data center environment. Troubleshooting efficacy is the key performance indicator for fixing issues fast. However, troubleshooting is a tight wire that we dare not to walk too often for fear of being blamed for either incompetence or incorrectness.
As IT professionals, we need to be right a lot more than wrong, especially when it comes to the money-making applications and their corresponding infrastructure services. The IT profession gives zero quarters when things go terribly wrong (i.e., the IT blame game). When I joined the IT career path many years ago, one of my first mentors gave me some sage advice from his own IT journey. In many ways, it’s similar to the CEO three envelopes story that IT pros may have heard of before. Here it is:
- When you run into your first major problem (i.e., if you can’t solve it, you’ll be fired or replaced), open the first envelope. The first envelope’s message is easy: blame your predecessor.
- When you run into the second major problem, open the second envelope. Its message is to reorganize (i.e., change something, whether it’s your role or your team).
- When you run into the third major problem, open the third envelope. Its message is to prepare three envelopes for your successor, because you’re changing companies, either willingly or unwillingly.
A lifetime of troubleshooting comes with its ups and downs. Looking back, it has spurned changes in my career trajectory. For instance, troubleshooting the lack of performance boost from a technology invented by a multi-billion-dollar global software vendor almost cost me my job, but it also re-defined me as a professional. I learned to stand up for myself professionally. As Marvel’s Agent Carter states, “Compromise where you can. Where you can’t, don’t. Even if everyone is telling you that something wrong is something right. Even if the whole world is telling you to move, it is your duty to plant yourself like a tree, look them in the eye, and say, ‘No, you move’” (Captain America: Civil War, 2016). And I was right.
It’s interesting to look back, examine events, and associated time-series data to see how close to the root cause signal I got before being mired in the noise or vice-versa. The root cause of troubleshooting this IT career is one that I’m addicted and committed to, whether it’s the change and the opportunity or all the gains through all the pains.