Tuesday, September 6, 2011

How people want to manage networks

I was watching a cool video with Mark Benioff and Eric Schmidt (CEO's of salesforce.com and Google - you may have heard of them) and one thing they highlighted between Microsoft and Apple was Apple focused on the consumer experience more than Microsoft did. Now I'm not about to comment on that but it did get me thinking, how do network managers want to manage networks?

I mean the way it works now managing networks is a lot of work. Typically you have a ton of devices that you need to manually configure, or at least configure separately, using a "command line interface" or CLI. These are usually pretty cryptic commands, though they do usually have a help key, like ?, to make it easier. The problem is though you usually need to do this to every device to make a change or view what's going on.

Imagine a scenario where you get a call from a user in a remote site saying "SAP is slow". Generally that prompts a lot of questions, like:

"What site are you in?"
"Are other applications running slowly?"
"Is anyone else having the same issue?"
and my favorite
"Have you rebooted?"

Now even once you get these questions answered, typically a network engineer will need to find their laptop and boot it up (which can take a few minutes), connect to a network or cellular modem, fire up VPN and then start to look around at the site to see what's going on.

If this is an after hours page, those minutes can seem like a long time (especially at say 3:00AM when you are trying to be quiet so you don't wake the rest of the house). Then you need to start "telnetting" to different routers and switches to figure out what's happening. Many times it's as simple as a bad cable or port and simply changing that can fix it, but it's not always the port the user is plugged into. Sometimes it is a port "upstream" that can take longer to find.

Other times, it's a simple issue of too many people using the link, sometimes appropriately, sometimes not.  With viruses, the users may not even realize that they are using resources.

I spend a lot of time thinking about how to make these problems easier to find. I'd love to say we don't have them at Enterasys but we do to, but when we have them we figure out how to fix them, I mean really fix the underlying issue, like why does it take so long to figure out someone closed a fiber cable in a door in Ireland.

What we came up with is called isaac. With isaac, instead of having to boot up a laptop, connect in through VPN and then start troubleshooting by going to each device, I get to "chat" with my network. In the scenarios above the chat is really simple.
"Are any devices down". I should have already been paged on these of course, but it's good to double check.
"Is the site experiencing any bandwidth issues"
       if so "Who is the biggest user",
       then if I want to I can stop the user from causing problems with a simple command like:
                    "ratelimit <user>", or
                    "blacklist <user>"
"Are any ports showing errors"?
     If so where are they so I can have a local technician replace them.

I can actually do these commands from my smart phone, or anyone else's smart phone that lets me get to Twitter or Chatter

We think this is a better way. What do you think?  I'd love to get comments on how you want to manage networks. What other commands would you want to see?


Thursday, September 1, 2011

Network "Ghostbusting" with isaac

One of the worst network issues to troubleshoot is the "ghosts in the network". This is the kind that are random, sporadic and as soon as you look one place, they seem to move. You know what I mean right?

It starts with a call or email "Hey the network seems slow" and as you start looking at the network closet the user is in, you hear another "Is the Internet down?" from the other side of the building. Next more calls come in but everything you monitor shows up and working fine. Usually these come in at lunch, or even worse just as you sit down to a nice steak dinner at a fancy restaurant...

You can spend a lot of time tracking down these issues, at least I know we do, well used to. I'm pretty lucky because when we are troubleshooting an issue, we usually will grab a few of our network engineers to help us. Not because we can't figure it out, but it lets them see what we do to troubleshoot and often times leads to a new idea for a product. Flow setup throttling came up this way a few years back and some cool isaac commands started this way too.

So imagine you start hearing about ghosts. What can you do. The first thing is to find the user. There's not much point troubleshooting the Andover LAN, when the user is travelling and actually in Germany. (Trust me, been there before). This will get you all the technical details you need, like which port or access point they are connected to, what the IP is, what role they are in etc. It also gives you a regional and local map in case you need to dispatch someone out there. The command is called "find ", pretty obvious right? If not you can actually create a new name for the command like locate, or "vinden" if you want the command in Dutch.

The next thing is "topintferrs" this will show you the ports in your network that are generating the most errors. This could point you to a bad GBIC or fiber connection or a bad copper cable. I also usually run "topusers" which will show the ports with the most traffic on them. If a link is running close to 100%, start looking there.

If you find a bad cable, you have the information to send the closest technician out to fix it. If you find it's a single users saturating a link, you can issue "ratelimit" to slow them down, or "blacklist" to kick them off of the network entirely.

The best thing is using isaac, you can do this from your iphone, while at dinner without leaving the table and letting your food get cold....