Friday, May 25, 2012

The problem with management tools



Management tools, whether they are for network, systems, nuclear power plants, or anything really, are really hard to get right. It's not that they can't report on errors, they can do that. In fact they can report on so many errors that you end up chasing so many errors you don't get to spend time on anything but chasing them.

Now what some people do is turn off everything, and then wait until something bad happens, go back and figure out what alert would have told them about it. This keeps you from missing an alert the second time, but also means you miss the first one. Not a bad approach but pretty time consuming as well.

Many systems have default warning values and while they are reasonable, they often times don't really match my particular environment.

For example we have a server that generally has 97% of disk space used. Now I know the argument is that's too close, you should add more space. But it's always been at 97%. It never fluctuates because none of the temp files are stored there. I don't need an alert every day that it is nearly out of disk space. If it hits 98%, maybe I want to know.

Now I can easily go in and set that particular threshold for that particular server, but there are literally millions of those in our environment. I would need a team just to go through and configure it once, let alone keep up with it. This obviously is not a very scalable approach.

What I really want is a smart configure tool. Really I want it to run and keep track of all the alerts and what it thought was wrong, and every day, or week ask me "How did things run?" If the answer was good. Reset the alerts to be high enough to not get tripped but low enough to alert me still.

No comments:

Post a Comment