Anatomy of a(n ongoing) Disaster..

Hopefully not THAT bad.

What a three weeks…

As I’m sure most of you already know, we’ve had nothing but troubles, large troubles, for pretty much the last three weeks. A lot of these troubles were our fault, a couple of them were at least ostensibly beyond our control, and they all compounded each other.

Here I’ll try and go into as much detail as possible about what happened, why, and the steps we’re taking to stop this sort of thing from ever happening again. I can’t excuse what happened, just apologize and hopefully elucidate.

Ironically, all the recent disasters stem somewhat from us attempting to take some proactive steps to head off any sort of future power outages like the kind we experienced last year.

Not THAT bad either..

The Back Story

As some of you may know, we are co-located with Switch and Data in The Garland Building in downtown L.A. To say we’re co-located is a bit misleading though, since we’re now basically 95% of their data center.

Why don’t we have our own data center?

Because, believe it or not, we’re still not big enough for it to make sense. Even now, we only use about 1000 sq ft of data center space.. for it to really start to make sense to get our own space, we’d have to be using around 2500 sq ft. Mainly because when you buy a data center, you want to get one big enough to handle a lot of growth.. and although it’s cheaper per square foot than co-locating, you have to pay for all the space you’re not using yet.

And really, The Garland Building is supposed to be an excellent place for data centers. There are more than a dozen in the building. Companies like iPowerWeb, Media Temple, BroadSpire, and even MySpace (now the most popular website in the whole US!) are in there. It’s got FIVE huge generators, UPS for the whole building, on two separate power grids, and a dedicated engineering staff to make it all work flawlessly. Or so we were all assured.

Around last June though, the building informed all its data center tenants that they had essentially run out of power! Not power altogether, but the “good” power that data centers need.. i.e. ups and generator-backed power. Because Wells Fargo, who holds the master lease on the building, wasn’t sure if they were going to renew the lease when it is up in three years, they didn’t want to invest the millions of dollars to add more generators and ups to increase capacity. This is in fact the primary reason we’re still not selling any more dedicated servers .. they use too much power per dollar!

Of course, none of that was supposed to have any affect on their ability to keep the current power going in the case of an outage. September 12th, 2005 we discovered they actually couldn’t… when two of the five generators failed!

However, since then, the building has repaired and replaced the faulty generators, and given all their tenants numerous assurances that what happened before would never ever happen again.

Not THIS horrible..

Why didn’t we move data centers right then?

That would have been a fairly massive undertaking, resulted in even more down time, been very expensive, and actually we did look around and there weren’t any really good options for moving… data center space is becoming pretty tight (in the LA area at least) and the Garland Building is still one of the best options, believe it or not. Also, this was the first time something like this had ever happened, and it seemed pretty reasonable that it wouldn’t happen again. We even asked around and none of those other tenants mentioned above were moving, so I guess it seemed like people were generally pretty confident it was a one-time freak occurrence.

Nevertheless, we started making contingency plans, searching around for another data center that had some power and would make sense for us. Eventually, we found Alchemy, just down the hall from S+D actually, and began making arrangements for getting some space from them. They had a little bit of power available because they were moving some of their clients out to El Segundo, and because they had gotten permission from the building to install their own generator. With that generator and some UPSes they were able to convert a “dirty” power feed into “clean” (i.e. good for data center use) power.

Pretty bad...

How the troubles began.

All this took a very, very, very long time. After months of searching and negotiating with Alchemy, we still had to get Switch and Data to allow us to put a cross-connect in from their data center over to their competitors down the hall. After even more months and teeth-pulling, we finally got that up and running. In fact, we finally got the first live server up in Alchemy a little less than a month ago.

All this in an attempt to head off future power problems.

Unfortunately, shortly after setting up the new footprint, we noticed something wasn’t right. Getting to Alchemy from Switch and Data we would lose huge buckets of packets. Just as we were trying to figure out the problem, we started to have problems with one of our file servers.

This resulted in a lot of problems across the board. The web servers that mounted that filer all had problems. The mail servers that mounted that filer all had problems. In fact, one of the mail servers was mis-configured and was logging thousands of errors a second to a remote logging machine… so many in fact that it was saturating its switch and clogging up a whole chunk of our network. Which in turn caused other machines to get slow and crashy because they couldn’t get to their filers, and so on and so on.

It turned out the filer problem seemed to stem from the fact that we had one shelf of 300GB disks and one shelf of 150GB disks on it. Apparently they’re not supposed to be able to support this, or at least it’s a bad idea. So, this was entirely our fault. However, we did have a number of other filers we did this on, and we’d never had problems before. Nonetheless, we will never mix disk shelf types on a file server again.

We eventually cleared all this up.

However, the Alchemy connection problems were still ongoing.

After trying all sorts of things, we eventually decided to replace one of our distribution switches that was acting strangely with a new one. This didn’t really seem to fix the problem either. This was on Friday, July 21st.

Never strikes thrice..

On Saturday, July 22nd, the building lost power.

This time, the generators actually worked, but the UPS failed! Honestly, it was much better than last year’s.. but unfortunately, even a brief power outage wreaks havoc on a data center. And this one wasn’t so brief.. here’s the building’s explanation:

At around 5:21pm, on Saturday July 22nd, a brown out occurred due to record high temperatures in downtown Los Angeles. Voltage dropped due to the high demand of electrical current along with equipment failure operated by the Department of Water and Power, City of Los Angeles. This condition caused failure of “ATS-B” switch and to UPS Module #3. Engineering crews were dispatched and began repair of this damaged equipment. A power interruption was required to replace contacts in “ATS-B”.

Repair of “ATS-B” failed contacts was completed on 7-24-06. Power was restored between 4:00am and 4:30am by the Engineering department.

Thank you,
Office of the Building

So, after all the emergency filer stuff going on the previous weekend, just about the entire admin team was back last weekend, working on getting everything back up when power came back on. Even when we had power, it was in a degraded state and so the cooling wasn’t working. As temperatures rose, file servers automatically shut themselves down rather than risk being damaged by the hostile environment. Apparently, MySpace made the decision to just keep all their servers off until cooling was restored.

Where are the engineers?!

More network troubles..

After the power outage, we decided to just yank everything back out of Alchemy (they lost power too!) until we could figure out what was going on with the network to there. Unfortunately, this didn’t seem to fix things, and our internal (“red”) network was still really fubar. When our red network isn’t working, the panel isn’t working, webmail isn’t working, and our server configuration system starts having problems (basically, anything that connects to our internal databases).

It took us just about all of Monday to figure out (and then fix) that a lot of the file servers had bad routes after being powercycled.. and so were sending ALL their traffic through the red network, saturating it. These things are generally pretty stable and a lot hadn’t been rebooted since September 12th, 2005.. and some had apparently had their networking set up by hand instead of correctly configured via our database. We’re making sure that doesn’t happen anymore either.

More network troubles..

Once that was fixed, things generally got better. Except there was STILL strange stuff going on (causing slowness and high loads around the system, but not an actual system-wide outage), even without NFS traffic going through red, and even without anything at Alchemy. It started to look like there was a problem with one of our core routers. We called our Cisco consultant and opened a trouble ticket with Cisco themselves..

Servers crashing? Not so bad.

More power problems..

On Friday, July 28th, we lost power again. The building wrote:

The Garland Building experienced a dead short which resulted in a brief power outage today, July 28, 2006. The air conditioning, elevators, and the electrical utility have all been restored.

While on generator power, a dead short occurred from one of our internal telecom users. We are investigating where the dead short occurred. A follow-up memo will be sent by the end of the business day reconfirming our transfer at 11:30pm tonight. We are currently on DWP power until further notice.

And then:

The Garland Building UPS System is back on-line supplied by DWP. Diesel generators have returned to an on-call status.

The 11:30pm transfer has been cancelled due to the dead short prematurely returning us to utility power. At 4:30pm the engineers engaged the UPS System to protect all tenants at the Garland Building.

Thank you,
Office of the Building

This time, we were able to get our entire system back up much quicker and with close to no problems. Of course, it had been less than a week since our last power outage.

Alchemy was the only data center in the building who did not lose power this time.

Could be worse.

More network troubles..

Over the weekend (this last weekend), we kept having the same ongoing weird network problems I mentioned above, and Cisco hasn’t made much progress. Yesterday, we realized the new distribution switch (an extreme) was causing spanning tree problems with the older Ciscos. Jeremy got it all figured out, but in the process it erroneously blocked our “green” (public!) network for a few brief periods, taking down everything again.

Unfortunately, that fix STILL doesn’t seem to have fixed the ongoing core network problems. We were finally able to get our tickets escalated with Cisco yesterday. It is starting to look like something may have been damaged during the first power failure, although we’re not sure. The replacement/repair cost might be around $80,000 it looks like.

Happier days..

And that’s where things stand today.

Our number one priority right now is getting this nagging network problem understood and fixed. Once that’s the case, we should be able to put things back in Alchemy, who didn’t lose power on Friday at least. Once things are going good there, we’ll be able to add new servers and transition old ones slowly with little to no downtime.

We’re also going to be buying our own UPSes, since we have learned we can’t trust our data center OR our building to do it. We’ll start by putting the core routers on them, then our internal databases and servers, then our file servers, and finally the hundreds of customer mail, web, and database servers.

The end.

Finally…

We’re very sorry for what happened. We definitely don’t want it to happen again, and we’re trying to take all the practical steps we can to prevent it. We never want to have another July 2006 again.

Ironically, some of the network problems seem to have stemmed from us trying to better protect ourselves from power failures. I also want to say for the record that none of these problems in my opinion stemmed from “overselling”. Rather, I’d say it’s the result of bad luck. And incompetence on our (and the building’s) part.

I don’t know if we’ll be able to change our luck, but hopefully we’ve at least learned something and will be able to become a tiny bit less incompetent in the future.

I hope you’ll all stay with us to find out.