The adoption of cloud services has driven the data center toward what many have termed hybrid IT. Though there are various definitions of hybrid IT, the common theme among them all is that it’s about services being integrated and supplied by both internal IT and third-party service providers. These service providers are typical best-of-breed and may include cloud service providers like Amazon® Web Services (AWS®) and Microsoft® Azure®, or any number of Software as a Service (SaaS) providers.
The challenge with an application stack that spans both technology construct domains as well as service provider domains lies in the complexity of these interdependencies. Ultimately, when an application has performance issues or goes down, the ownership and responsibility to troubleshoot and remediate those issues can get quite obfuscated. The result is a longer time to resolution, which makes the organization less efficient and effective, and at the same time, creates more friction for the endusers to overcome in order to consume the application.
Troubleshooting Hybrid IT
Even though troubleshooting a hybrid IT application involves multiple environments, each with unique variables, the fundamental steps remain the same. The basic eight step process to troubleshooting is:
- Define the problem
- Gather and analyze relevant information
- Construct a hypothesis on the probable cause for the failure or incident
- Devise a plan to resolve the problem based on hypothesis
- Implement the plan
- Observe the results of the implementation
- Repeat steps 2-6
- Document the solution
The problem with these eight steps is that it assumes a data center administrator has unlimited time and resources to determine the root cause of every problem that arises.
Unfortunately, time is a limited commodity for any IT professional. Fortunately, one way to shorten the standard troubleshooting procedural time is to leverage and excel in the skills of discovery, alerting, and remediation. After all, they are the basis for troubleshooting — steps one, two, and three in the troubleshooting workflow are covered by discovery and alerting, while steps four and five are remediation and steps six and seven are exclusive to troubleshooting in action. Pairing these three skills with a proper IT tool set can greatly improve the ability to surface the single point of truth and provide the necessary insights to root-cause any issue in an efficient and effective manner across multiple stacks.
Proper Insights with Correlated Data and Collaboration
With process in place, proper insight is still required to quickly eliminate the variables involved in troubleshooting hybrid IT applications and surface the correct single point of truth. Let’s walk through a hybrid IT troubleshooting scenario:
Take, for example, a tiered e-commerce application, which typically has multiple layers, such as application, web, and database servers. All of these are connected by network services and may rely on infrastructure services like Amazon EC2® and S3, in addition to other services like auto-scaling and elastic load balancing.
When the application is functioning as intended, the business can reap the benefits of agility, availability, and scalability of services, but when a component of that stack fails, it can lead to panic, since every second lost is a second that a sale cannot be closed. But where do you start as you try to narrow the problem’s surface radius? It begins with the proper insight — the connected context of the ecosystems.
This proper insight will eliminate questions like:
- Is it the network?
- Is it the application, web, or database servers?
- Is it the virtualization layer or storage systems?
Having this ability will allow you to zoom in and out of the performance metrics and service events, enabling you to reduce your troubleshooting time. Additionally, time-series data that can be correlated allows you to shine; specifically, shine a light on the root cause of the issue.
Correlated time-series data is still a massive amount of data to wade through, though. In situations like this, only collaboration among subject matter experts can truly take you from paralysis by data analysis into tangible actions from the data set. Corroborated subject matter expertise minimizes the time to resolution and establishes the authoritative baseline. With all of this goodness, surfacing the root cause of any issue becomes intuitive.
Hybrid IT applications are a reality for nearly every IT professional. Instead of playing the IT blame game when it comes to performance issues, you should rely on tried and true troubleshooting fundamentals, since troubleshooting is a foundational skill and one that is a part of monitoring as a discipline. Pair this with time-series data and collaboration, and you can shoot the trouble out of any hybrid IT application in record time.