Information technology executives around the world are seeking out ways to minimize downtime. Part of that includes choosing an artificial intelligence for IT operations (AIOps) toolset that works for their enterprises. The current landscape is flooded with options, all trying to be the tool of choice for implementing IT operations management (ITOM). Tool variety and coverage become essential in digital infrastructure environments, where a single monitoring tool is insufficient to provide visibility and observability across multiple silos.
Today’s Reality
Many organizations use between eight and 10 monitoring tools but are unhappy with their existing operational setup. Despite being able to collect an overwhelming volume of data, many insights are still not being processed effectively. Coupled with that is the fact that teams need to frequently deal with alert storms and lack clear event correlations and root cause analyses.
Most AIOps tools on the market still automate the resolution of incidents that have already occurred through orchestration workflows and IT service management (ITSM) integration as part of the larger automation strategy. This leaves troubleshooting teams to rely on dashboards to become aware of issues via events, logs, or alerts after they’ve occurred. However, this reactive approach is often too little, too late.
Evaluating AIOps Tools
All AIOps tools in the market call themselves proactive, implying they use a mix of statistical and analytical techniques to arrive at dynamic thresholds on key performance indicators (KPIs) and can deliver alerts when a metric deviates from a threshold. In addition, these tools may claim they can leverage business rules that allow preprogramed actions to be taken when anomalous behavior occurs — whether it’s a notification, a ticketing and orchestration workflow, or merely putting data onto a dashboard.
These AIOps tools measure their efficacy in terms of mean time to resolve (MTTR) an issue. However, this means that downtime occurred and the issue was resolved after the fact. Given the fact that enterprises have woken up to the tremendous business impact that even a few minutes of downtime can have on customer experience, revenue, and cost of resolution, a new buzzword was adopted in 2020 — negative MTTR. This essentially means resolving an issue before it even occurs.
New advancements in AIOps tools leverage preventive healing solutions. These solutions measure a metric that most AIOps tools do not — the effect of workload on system behavior. These tools can detect when application workload is not following trends of seasonality (time of day, day of month, month of year, etc.) or is anomalous with respect to number of inbound requests of a service and will raise a flag if there is a cause for concern in terms of the effect it may have on service behavior. If models are built to measure and learn this workload-behavior correlation, it is possible to warn of an impending issue ahead of time and correct it before it occurs.
When evaluating the need to shift to preventive healing tools, there are some key questions decision-makers need to address.
- Am I ready to transition to a zero downtime enterprise?
Ensuring high availability is critical to business success. However, keeping your business continuously available implies dedicating valuable time and resources to the demanding task of keeping IT infrastructure up and running 7x24. By moving to a preventive healing solution, enterprises can be equipped to start moving toward a zero downtime/negative MTTR issue resolution paradigm. This means an imminent issue can be flagged and automatically remediated using various techniques, some of which include the following.
- Dynamically optimizing workload on the fly through tools like Cisco WOM to reduce the load on the underlying infrastructure.
- Optimizing infrastructure in a cloud/microservice/containerized setup using tools like Istio.
- Initiating service-centric mechanisms to heal based on time-synchronized contextual data, such as forcefully terminating a nonessential database query, holding onto session locks, and preventing subsequent queries from being executed.
These healing mechanisms can be integrated with the underlying ITSM's orchestration workflows seamlessly through representational state transfer (REST) interfaces. These empower enterprises to gradually move from minimal to zero downtime, thereby reducing the costs of running ITOM, maximizing customer delight, and keeping operations centers lean.
- Can I minimize operational costs while reducing downtime and MTTR?
Despite preventive healing and efforts to move toward a zero downtime enterprise, some issues may slip through the cracks — particularly when they are unrelated to workload and are caused by external factors, like disk crashes, erroneous code, poorly designed queries, or wrongly configured services. In such scenarios, the focus shifts to minimizing the MTTR and reducing downtime as much as possible to prevent customer experience from getting affected.
Root cause analysis enables ITOM teams to pinpoint the cause of any incident and address it given all information at hand. Implementing an AIOps solution that provides contextual, timely, relevant, and accurate information on the state of the application in a concise, intuitive fashion can help your team perform event correlation and analysis effectively. In ITOM parlance, such dashboards, with all service information and event correlations presented in a unified view, are called a "single pane of glass."
Service data pertinent to an issue that is extracted at the time of an anomalous event can have multiple dimensions, all of which need to be analyzed before root cause can be arrived at. Some examples of this data include logs, code traces, query level database statistics, configuration changes, and diagnostic data called forensics. Identifying and evaluating this information can tell you about the state of the service at the time of the incident to help eliminate future issues.
- Can I optimize infrastructure investments while scaling intelligently and effectively?
To plan for future workloads, your AIOps solution needs to be able to correlate projected workload trends to corresponding infrastructure requirements. In doing so, it is important to not only highlight under-provisioned resources that need to be scaled up but also overprovisioned ones that are a drain on business spends and need to be scaled back. Running a what-if analysis on projected workload to examine corresponding capacity forecasts is a critical step in this process.
At the end of the day, choosing an AIOps tool with preventive healing software can help IT identify problems before they happen. By making the switch to advanced alerting mechanisms, coupled with contextual data on the state of the system at the time of the anomaly, you can prevent issues before they cause an interruption to your customers and downtime to your organization.