When I’m on a larger team, I typically end up becoming the monitoring guy. I’m not really sure what life looks like without a monitoring guy constantly tuning the NMS (Network Monitoring System), but I don’t really want to find out either.
You can be the monitoring guy (or gal) too if you follow my workflow below:
At e-Mayhem, we have a system of continuous improvement that we follow to increase the reliability of the systems we manage. This article will share how we do that.
Too many vs. too few alerts
If you have too many or too few alerts coming in, the obvious solution would be to add or turn off alerts. There are obvious problems with doing this as well. If you shut down critical alerts, you’ll start missing outages. Turn on the wrong ones, and you’ll soon be sifting through a ton of trash emails. If you have too many coming in, your team starts suffering from alert fatigue. It becomes necessary to start ignoring alerts. This develops into a habit and the alerts devolve into background noise. In an ideal state, you would have just enough alerts that responding to all of them is manageable.
Each outage or alert is an opportunity to improve the reliability of your system. That’s done via reacting to the alert with a postmortem.
Which alerts should I change?
To know which alerts you should be messing with, you need to consider the impact of the incident and the cost of responding to them.
Cost of Responding and Urgency + Importance
I’ve numbered them based on priority. Ideally, you’d be adding to #1 and trimming from #4. Sometimes that’s not always possible and you need to just do what you can.
Usually, when I join a new team, I make sure I’m getting all the alerts and spend time assessing the situation to determine which quadrant each alert is coming in on.
Like with anything, there’s a limit to how far you should take this. I usually only make this a priority when the ops is so bad, it’s a distraction from completing projects. Otherwise, I just use this as a filler project for any extra bandwidth the team has.
Thank you for checking this out. Hopefully this blog post gave you some good ideas on how to properly tune your NMS to your advantage so that it can be your friend rather than your enemy!
If you had any comments, questions, or just wanted to share your thoughts on this article, you can contact me at firstname.lastname@example.org
e-Mayhem helps companies successfully deliver business projects. We also help companies avoid losses associated with IT disruptions and security threats. You can learn more about our services at e-mayhem.com or by emailing email@example.com