IT Operations Trench: March 2012

Friday, March 2, 2012

Please no more cloud FUD

Even before I read about the Azure outage I knew some where someone had a cloud outage. I could tell because there seemed to be an uptick in the number of "I told you cloud wouldn't work" articles, tweets and blog posts. Please no more...

Now I don't work at Microsoft and don't have any secret knowledge about what happened, but reading the public posts, it sort of sounds like a leap year bug caused "service management to go down". I suspect during the fixing of this, some other issues were caused that impacted performance and caused intermittent stability.

Now I don't use Azure but "Service management" sort of sounds to me like any existing service would continue to work, but new ones can't be brought up. If that's the case that's not a real big deal. Admittedly it would stink if I was planning to launch my new hot startup on 3/1 and couldn't bring production online, but I've got to think that's pretty rare.

It's hard to disagree that a leap year bug shouldn't have been missed, but hey I've let some stuff slip through that in hindsight should have been caught. I mean who hasn't done "reboot" or worse "shutdown" hit enter and then said "Damn wrong window". Mistakes happen.

Mistakes happen in our own data centers too folks. Anyone that honestly has never had an outage either runs a "data center" consisting of an Xbox, Wii and Pentium PC in their parents basement, or makes so few changes that they are still running Sunos 4.1.3 because they aren't done testing that new Solaris stuff.

OK maybe there are a few folks that have been really lucky, but in today's environment we need to move fast. That means mistakes are going to happen.

Last year, or maybe two years ago now, we actually took quite a few servers down in our data center because of power. We actually have complete power redundancy and this should never have happened. Dual feeds, dual switch gear, generators, UPS, etc. The power even takes a different path through most of the building. Each cabinet has 2 (or 4) PDU's.

So what did we do? Well an administrator plugged in his servers and plugged them into 2 PDU's, one in the front, and one in the back. Unfortunately the redundancy is left and right. So even though it was in two PDU's they were both on A power.

We had to take one of the UPSes offline for maintenance and since we know we have redundant power we did this at noon. Looking back, not a great call. But it also wasn't the end of the world. We learned from it, corrected our mistakes and moved on.

No one said "See I told you we shouldn't have hired Rich". Well not that I heard anyway. My point is we all have outages, we all make mistakes, let's just stop with the silly "See cloud isn't reliable" every time someone has an outage.