At some point in your career, you will come to understand the power that exists with customized configurations of your monitoring system. Usually, the result of an issue where you say, “if only we had been monitoring this one other thing,” is monitoring 100 extra metrics, just in case.
But you can’t monitor every possible thing. Worse, you don’t want to sift through a bunch of useless data, looking for a signal through all that noise. You need to focus on the right things, the core things, that help give you the edge you need. With just a few adjustments, you can change from being reactive to proactive. It’s the difference between being a firefighter that responds to alarms and being the architect that designs the house to be fireproof.
Here’s my list of metrics that you should consider monitoring to help you take control of your environment.
Quality of Service (QoS)
Monitoring for QoS is a top priority for today’s distributed hybrid IT enterprise. You need the ability to examine the network as a whole, including metrics gathered from ISPs that are carrying your bits on their wires. Traditional QoS metrics include things like bandwidth, latency, error rates, dropped packets, and uptime.
But QoS can also help you uncover bad actors. A sudden surge of data could be the result of an adversary gaining access to your customer database and grabbing as much data as possible before you shut them down.
Quality of Experience (QoE)
QoE is all about the enduser experience. And that experience can be summarized with one metric: response times. That’s usually all the enduser cares about: the time spent watching a spinning hourglass or beachball on their screen. QoE doesn’t just vary by location, it can vary by cubicle. Two people sitting next to each other can have a very different experience. Something as simple as a network card can affect the rate at which data will show up on a screen.
In my world of databases, I find that concurrency issues are the biggest culprit for poor QoE. Locking and blocking will lead to a bad QoE for the enduser, but that doesn’t mean the network is down, or that there is an issue with the database server. Diagnostic tools that help you see the entire stack of calls made between client and their data are valuable here, as that helps IT to pinpoint where they need to fix things for faster response times.
Resource Utilization
With the advance of virtualization there arose a need to verify your resources were being utilized. The Golden Ideal of virtualization was the ability to squeeze every last bit of performance out of the hardware we had purchased. Gone are the days of 10% CPU utilization. Now we expect servers to be 70% to 80% utilized all the time.
But there’s a new caveat in our cloud-first hybrid world. Virtualization introduced us to sprawl. The cloud took sprawl and made it even worse. Right now, you probably have resources in a cloud such as VMs, databases, containers, and load balancers that are lightly or never used. They just sit there, untouched, and add to your monthly bill.
Bottom line: if you bought it or built it, you should be using it. If you aren’t using it, turn it off.
Error Logs
For many, error logs are seen as a tool to use after an event has happened. Very few shops are actively scanning and mining their logs, looking for events or specific error messages. I’ll be blunt: If you aren’t actively mining your logs, you could be placing your company at a higher level of risk than necessary.
You should be looking for easy things to find, such as login errors. A spike in login failures could indicate an attack is underway against a server or database. More complex error-log mining would be for events related to SQL injection attacks. SQL injection remains one of the top vulnerabilities exploited by criminals. If you aren’t being proactive in your methods to be notified of a SQL injection attack, then you should know the average time to discover such an attack is more than 90 days. That’s right, 90 days of grabbing your data before you know about it.
Your Monitoring System
I know, this seems like an item that should go without saying. But as recently as 2017, AWS® suffered an outage with S3 and then suddenly found out the service health dashboard also relied on the servers it was monitoring. So…when AWS went down, the health dashboard went down as well.
That means we still have technology professionals who don’t understand the importance of “watching the watcher.” You should have a monitoring system in place that is designed for but one purpose: to report the status of your enterprise monitoring system.
Summary
Monitoring these areas should not be difficult; there are many tools on the market today that can help you to monitor and map your enterprise, giving you the ability to find root causes and prove (or disprove) what systems are responsible. These five areas won’t catch every possible problem, but they can get enough to make the cost of implementing a solution worth every penny your budget can provide.