Avoiding Alert Overload
Improve data center security by ridding yourself of unnecessary data center alerts
Almost everyone I know in IT is inundated with alerts from their management systems. When I talk with IT administrators, I usually see folders in their email applications that contain tens of thousands of unread alert emails. Not only is this a huge distraction, it actually makes it difficult to spot real security issues.
An alert should really only be sent when there is a problem that requires a human response. If you’re getting more than one or two alerts per week, I’ll go out on a limb and say that your data center is either so unstable you shouldn’t be running anything in it, or, more likely, there is too much noise in your environment. Most monitoring systems and tools have good ways to tune alerts, but it takes time and many argue they simply can’t afford that time to perform this level of tuning.
What’s my stance on that? I argue that you can’t afford not to! Security is no doubt a top priority for you — if it’s not, you and I need to have a heart to heart — and cutting through the clutter to make sure you are seeing and able to respond to the truly important alerts that could impact security is key in the process of proactively securing your data center.
So, here are a few steps for decreasing the noise from your data center alerts so you can better focus on those with real implications for the security of your sensitive data.
STEP ONE: SETTING A GOAL
The most important step in getting to a high signal to noise ratio with your alerts is to measure them and set specific goals to reduce the volume. You need to understand how many alerts you and your team are currently getting. From there, you should be able to group them into categories. For example, false positives (no alert should have been sent), informational (system was taken down for maintenance), and critical (something is broken and needs immediate attention).
You don’t need to spend a lot of time getting this perfect right now. Just try to get an approximation of where the volume is coming from so you can prioritize your next steps. Once you have binned the recent alerts, say, from the past week, try to automate this process. Simply going from one folder with alerts of all severity jammed into it to three folders with varying levels of severity will help. Most email clients have sufficient rule processing powers to help with this so long as there is structure to your current alerts. If not, well, we’ll get to that in step two.
Now that you have a general structure and some data about your alerts (the number by category), you can start to set goals to reduce them. If you set a specific goal, it will be much easier to make progress. I recommend starting with a 20% reduction in weekly false positive alerts. Common sources of false positives in a data center include forgetting to set a node or device to maintenance mode before taking it down, having unrealistic thresholds, applying thresholds too broadly, and too many alerts being sent because of an issue with a dependent service. If reducing your false positive by 20% in a few weeks is not realistic, decrease the goal to a smaller number, 10% or even 5%. The point is to set a goal, measure your progress towards that goal, and continue to improve. The value in this approach is that your efforts are cumulative and the more you do, the less distractions you will get.
STEP TWO: GET THE RIGHT INFORMATION
The second step for reducing the noise from your alerts is to make sure they include useful information. Go look at the last 10 alerts you received. How many included all of the information you need to even get started on the issue? All too often people receive alert emails and they don’t even know what rule generated the alert. Most tools allow you to specify the alert name in the rule. This simple change will have a profound impact on your ability to process, organize, and respond to alerts.
In addition to including the alert name with the notification, you should also try to include the device or system name; criticality of the alert; a direct link in your monitoring system to the affected system; current vitals, such as CPU, memory, disk, location (and not just city or data center name, but actual rack number and placement in the rack); local contact information if you need to work with a technician on-site; and finally, if possible, related systems and infrastructure.
You don’t want to have a debug level of detail, but you should have all of the first level information in the alert so when you are reading the email on your phone, you can take direct action to begin service restoration or mitigation, not lose time trying to gather additional information. When you’ve reduced the volume of alert noise, you might consider sending them directly to your primary inbox again and setting the high importance or critical flag in the email. I recommend only doing this once you’ve filtered out the noise and have a high confidence in your monitoring systems.
STEP THREE: CONSOLIDATE
Now that you are measuring and setting goals for alert reduction, and are making the alerts actionable by including the right information, you need to deal with the informational alerts. Sometimes these are the most difficult to let go of. These notifications are often a good way for you to find something that’s changing in the environment before there is a problem. Unfortunately, these are often a significant source of distraction and end up having little value.
Most of the time you likely take a look at a small handful of these alerts when you scan your folders or if a pop-up catches your eye. It would be much more efficient to have a daily or maybe even weekly report of informational activities. If you go to daily reporting and approach it in the context of information rather than something that requires action, you will be able to focus on this long enough to understand what’s going on in your environment and free yourself up to focus on the critical system alerts that could impact security.
Although this might seem daunting given the volume of alerts, if you start small and take a measured approach, you can quickly make progress. A little goes a long way when it comes to alerts, and the value you get from them will increase dramatically if you manage them properly. From a security perspective, you can’t afford to miss a serious threat. If you only receive a few alerts a week, it becomes much more practical to investigate threats and quickly address operational issues. If nothing else, managing your alerts better will decrease your email. I think we can all agree we could use less email.