Troubleshooting is all about finding the root cause of an issue in a timely fashion. However, the challenge revolves around priorities, especially if remediation has already resolved the issue and everyone has moved on because of time and resource limitations. Virtualization troubleshooting can add layers of complexities to an already stressful situation. And no IT professional has time for complexity. So, the question is: How can a data center professional not get stuck wasting valuable time and resources while shooting blanks at IT troubles?
Three troubleshooting scenarios in practice
By and large, there are three scenarios when it comes to reducing the troubleshooting radius. Keep in mind that I’m making the assumption that a proper monitoring tool is already in place, as well as a proper troubleshooting process and protocol, since it is best practice to do so with discipline and rigor. The three scenarios are:
- When you have some insights and experience with the incident at hand. For instance, as a process or service fails, you have existing baseline data as well as expectation of service level performance to rely on.
- When you have no existing knowledge or experience with the incident. In this case, you have to leverage your expertise to reduce the troubleshooting radius as quickly as possible and rule out false positives.
- When you leverage an existing knowledge base of known best practices to help shoulder the load. Going this route means you still have to do your due diligence to make sure you can trust and verify the dependability and adaptability of the recommendations engine to your unique environment.
Tackling scenario #1: Alerts-based thresholding
Addressing the first troubleshooting scenario requires a method that relies on alert-based performance counter thresholds to narrow the troubleshooting radius. Performance data should cover the application as well as the relevant underlying physical subsystems — compute, memory, storage, and network, along with their virtualized equivalents.
Active alerts can be used to proactively mitigate trouble spot risks, especially in well-understood application stacks. Well-understood applications are defined as applications that behave consistently in their performance trends with bounded existing performance baselines and thresholds. In this case, any alerts that arise can be dealt with using the experience of previous behavior as the guide to root cause the incident.
Tackling scenario #2: I got NULL
The second scenarios arise when it’s the first time an incident has been observed. There is not enough information to positively root cause the issue. To deal with this scenario, leveraging time series performance data that can be correlated across multiple layers and connections of the application stack is key to quickly understanding the root cause of the problem. But what if you don’t have access to all the information? This is where a proper toolset can help you not only aggregate and visualize key performance indicators, but also allow you to share and collaborate with other subject matter experts in their respective areas of expertise.
Tackling scenario #3: Recommendations that are trusted because they’re verified
Troubleshooting is time consuming and time is a resource that IT professionals cannot afford to squander. Therefore, the third troubleshooting scenario can be considered one in which assistance is provided via a recommendations engine. Recommendations engines leverage existing knowledge bases and known best practices while being able to tailor the strategies and policies to fit your unique environments. This allows you to quickly troubleshoot your environment without having to do the heavy lifting yourself. Just remember to verify the policies and recommendations, because every environment is unique.
In summary, troubleshooting can be a laborious and complex job, but using proper data center monitoring that allows you to leverage proactive alerts, correlated time-series performance data, and a recommendations engine will provide you with the discipline and rigor needed to quickly surface the root cause from all the troubled noise.