The thought of a time when bean-counting executives look for ways to remove sentient human beings (i.e., experienced staff) from the control panel in the data center may be the stuff of nightmares for many an IT professional. Even so, automation has become a must-have technology that even the most paranoid among us hope for. This is mostly because of the freedom from the monotonous remediation it offers. Removing thoughtful and measured oversight from problem resolution, however, can understandably seem like a daunting task. So, how do you get started?
First and foremost, organizations that either don’t have a monitoring tool or are using a cobbled-together monitoring system (looking at you, DevOps) should hold off implementing automation.
Moreover, if your company is constantly stuck in reactive mode, automation shouldn’t be something on the immediate horizon — the reason being, if your IT department already has its hands full fighting fires, there are likely insufficient resources available to implement and oversee automation in a way that will be successful. All you’ll end up doing is throwing together a half-baked quick-fix, fail to test it fully, and realize only too late that it’s causing more problems than it solves.
Setting workload issues aside, I find it interesting that one of the biggest barriers to automation often comes from a company’s security team. Even without the specter of autonomous scripts running roughshod over systems, the security administrator may have been wary of granting access to business data and metrics to a physical human being. Introduce the idea of scripted remediation, and security will be brought face-to-face with the concept of giving high-level access to an unmanned account that will be online at all hours of the day, essentially unsupervised.
However, both of those issues have to be resolved at layers 8-10 of the Open Systems Interconnection (OSI) model (money, politics, and governance). For our purposes here, where I want to talk about how to implement monitoring in your organization, let’s presume your organization has the staffing resources and the blessing of the security team. From here, the good news is that very little, aside from a solid monitoring solution, is needed within your organization in order to begin automating your data center. However, you should think of it more as a reward for having a successful monitoring system in place rather than something to try out of the blue.
So, you have a mature robust monitoring system in place now. Good! Let’s continue.
Once you have monitoring in place, then you’re able to automatically respond to certain conditions. Of course, there are some monitoring nice-to-haves prior to automating, such as the ability to:
Monitor and establish baseline performance metrics across applications or workloads. Baselining is a function of your monitoring tool, rather than something that is done manually. By tracking ongoing data, it can establish what normal is for each system and sub-element. This allows your monitoring and automation solutions to consider whether something is truly abnormal and in need of an alert, or whether a device is typically at 80% on Tuesday mornings, for example. Baselines also allow you to automate the act of capacity planning by using long-term usage data to extrapolate when a resource will be completely consumed. Thus, you can plan more effectively and avoid surprise upgrades.
Automate the response within the same alert that detected the issue. This is predicated on the fact that your monitoring solution has built-in, automated responses to certain scenarios, such as restarting a service or resetting an application pool, rather than requiring the administrator to script out for each action.
Check the system after the automation has run. Some monitoring solutions can execute an action in response to a problem, but that’s as far as they can go. If the problem repeats, so does the automatic response. In these cases, the escalation only happens if the outage is noticed or a second alert is created that checks for the same problem over a longer duration. The preferable alternative is to ensure the monitoring solution can respond to the initial problem with automation, but check again after a short period of time to see if the problem was resolved. If it wasn’t, a secondary action — usually notifying staff — can be taken.
Before we get into the various stages of automating systems, it’s also important to address the elephant in the room: automation is not necessarily the same as solving a problem. Automatic responses to alerts keep your data center — and business — running even after you’ve gone home, but in the morning, you and your team need to be able to see whether an event occurred, and do the hard work of figuring out why it happened to prevent it in the future. For example, a recurring “disk full” alert can be repeatedly addressed by an automated system, but the root cause of the alert has not been remediated and will be disrupting the overall end-user experience until it’s finally addressed.
To that end, we’re seeing more sophisticated tools become available to help you deal with the “now what?” question that comes after both monitoring and automation implementations. These single-pane-of-glass integration tools let you mix and match key metrics from individual monitoring silos, in a time-synchronized view, to more quickly and accurately identify the root cause of problems in hybrid IT environments. These tools also enhance inter-department collaboration to ensure the time to remediation is as quick as possible, in addition to empowering IT professionals to build better automation. Once you can identify and resolve the root cause of a problem by seeing all the individual pieces that contributed to the overall failure, you can script more intelligent automated responses.
With all of this in mind, here are three initial steps your organization can take to begin its automation journey. Although every business’s path will be slightly different and, as such, certain phases may vary, these three steps are designed to help you slowly but surely begin to take your hands off the wheel and automate systems.
Information is key. You can start thinking about automation by beginning to populate your tickets with more information about the device, the target system, the sub-element, and the time in which a failure happened, or even a link to the monitoring tool that shows the element affected. Sometimes, the best automation you can do is to provide the initial technician with as much information as possible about the state of the system at the time of the actual problem — vs. the 10 to 20 minutes after they’ve rolled out of bed and fired up their laptop — in order to address the problem.
More information is better. In the data center, more information is always a good thing. In this next phase, you should look to add even more information into your alerting systems. This is not just a coy way of saying “more of number one.” Think about ways to spontaneously gather detailed information that is relevant to the problem, but not already in the monitoring system. Examples may be the top 10 processes — sorted by CPU usage — when CPU is over a threshold, the top 10 processes sorted by RAM when memory is critical, the number of connections to a web server at the time an IIS™ process got hung, or the longest running queries at the time a database server showed slow response times. The more relevant information you can provide to the first responder will give the technician more power to surface the actual root cause and address it as quickly as possible. This is especially critical when you suspect a failure that is impossible to view moments later — you can’t get the top 10 processes if the entire system is hung — but don’t have the visibility to pinpoint it. The more information in each ticket you’re able to share with that vendor will also help them paint a clearer picture of the potential issue and begin remediation.
Start slow, small, and simple. It’s best to start with the low-hanging fruit: restart a service when it stops, address a disk full, clear out a log-file, etc. The goal is to gain experience with automation with low-risk actions, and then using what you’ve learned you can graduate to bigger, more complicated tasks. All too often I see organizations try to implement automation by starting with something big and flashy, but it’s much more effective to start by looking to automate the things that have the most impact for the least effort. In fact, trivial but frequent things like disk-full errors can cost organizations thousands of dollars a year in lost time, opportunities, resources, and materials. This is an easy problem to detect — in most cases, you can even predict it and therefore avoid the actual issue entirely — with an easy fix. Small successes like these will help you sell the benefits of automation to your management and pave the way towards larger-scale initiatives.
As you become accustomed to accolades and looks of respect from your colleagues and even management, please remember that, even though automation saves you from experiencing the symptom of a problem in the middle of the night, it still requires you and your team to investigate and address the root cause at 9 a.m. the next day.
The initial deployment phases discussed above act as a guide for your organization’s seamless transition from good monitoring to good automation by working with and sharing as much system data and information between teams as possible.
Ultimately, good automation is enabled by and is a result of good monitoring. When done correctly, it’s simple — in fact, it is simply automation the way automation is meant to work.