Friday, May 25, 2012

The problem with management tools

Management tools, whether they are for network, systems, nuclear power plants, or anything really, are really hard to get right. It's not that they can't report on errors, they can do that. In fact they can report on so many errors that you end up chasing so many errors you don't get to spend time on anything but chasing them.

Now what some people do is turn off everything, and then wait until something bad happens, go back and figure out what alert would have told them about it. This keeps you from missing an alert the second time, but also means you miss the first one. Not a bad approach but pretty time consuming as well.

Many systems have default warning values and while they are reasonable, they often times don't really match my particular environment.

For example we have a server that generally has 97% of disk space used. Now I know the argument is that's too close, you should add more space. But it's always been at 97%. It never fluctuates because none of the temp files are stored there. I don't need an alert every day that it is nearly out of disk space. If it hits 98%, maybe I want to know.

Now I can easily go in and set that particular threshold for that particular server, but there are literally millions of those in our environment. I would need a team just to go through and configure it once, let alone keep up with it. This obviously is not a very scalable approach.

What I really want is a smart configure tool. Really I want it to run and keep track of all the alerts and what it thought was wrong, and every day, or week ask me "How did things run?" If the answer was good. Reset the alerts to be high enough to not get tripped but low enough to alert me still.

Friday, May 18, 2012

What's your vendors CLOUD score?

Are you, like almost everyone, looking at moving to the cloud? Before you move your critical applications to a new provider check out their CLOUD score.


Company: Are they financially stable? Is there management team seasoned and well respected? Who invested in them? Are they making money or burning cash? Will they survive a disaster and be there for you still?

Legal: Is the legal contract right for you? Check out this blog post to see if you have added the right clauses to protect yourself. Having a good relationship is key, but if something goes south the contract is what the courts will look at, not what the sales team said over lunch.

Openness: Can you integrate with other systems easily? Can you move your data out if needed? Can they use third party authentication methods, like LDAP, or SAML? Do they have a "trust" site so you can see if they are having performance or reliability issues? Do they share there roadmap so you can plan appropriately? Is their knowledgebase available and useful?

Usability: Can users learn the tool quickly? Is training online available? Can you make domain wide changes with an administrator tool? Can you do bulk uploads? Is it easy to manage the system? Is it easy to work with support?

Development: How easy is it to customize? Can you make meta changes from the user interface, or do you need a coding expert to make all changes? Do your developers know how to code in the language it uses? If not, how hard is it to find qualified developers, or train yours? Are the API's well written and robust?

If you ask all of these questions and are comfortable with the vendors answers, you can't go wrong. We use a spreadsheet that asks these questions and weights the scores giving us a result. Using the CLOUD score really helps us make the right decisions and avoid problems.

If you want to check out the spreadsheet we use to calculate our vendor's CLOUD score, check it out at

Friday, May 4, 2012

Stop micromanaging your network

Are you a micro-manager? I don't mean with your staff, with your network? Everyone knows that micro managing causes a drain on energy and efficiency with teams, but did you know it can do the same to your LAN?

In some networks,  management traffic like SNMP, netflow and ICMP can use 30% of the bandwidth and can actually cause some of the issues you are trying to stop. Partly this is because of the broken paradigm we use to manage networks. We configure and manage  everything separately.

What we really want is a way to communicate with the network and describe the behavior we want, and then let the devices work together to "make it so".

Imagine that you use SAP. (OK many of you do use SAP, so that's not too hard to do right?) SAP runs your company and is obviously very important. Now imagine you could tell your network, the whole network not device by device, that SAP was important so treat it as important.

Now maybe peer to peer traffic isn't important so you don't want to have that take up all of your resources, but if the resources are just sitting there doing nothing but costing you money, let it be used. Some traffic you just may not want on the network ever, either because you don't use it and want to mitigate risks, or it violates a regulation.

As new applications came on that are not classified and start to get used, the network should be smart enough to let you know. "Hey I've seen a lot of traffic using a new application called skype. What do you want me to do with it?"/ and you could communicate back, block it, make it important, make it un-important, and the network would know what that means, and configure itself to do that.

I think this is the future of network management. What do you think? Using things like onefabric, isaac and coreflow2 switches, we are well on the way to this vision of the future being a reality. Learn more by going to, or ask me.