Hey everybody! So, yeah, yesterday was exciting.
I’ll give you a little timeline/rundown on what-happened-when yesterday. But to generally describe the problems:
Beyond the obvious LA DWP screwup, and the building generator screwup (more word on exactly what that was soon), there are some non-trivial problems when all of your stuff shuts down. The biggest is that when any group of computers is unexpectedly powered off, a small percentage don’t come back up. When you’re talking about a few hundred servers, a small percentage becomes a significant number!
You do also have to be somewhat careful how quickly you power things up, because a popped breaker is just about the last thing you want.
11:45 Went to lunch at sushi place on Flower St.
12:30 BLACKOUT! We’re walking back and see streetlights out, office tower generators starting up. Our building’s UPS kicks in, generator is running (we can see the smoke from our office).
12:45 My idiotic blog post.
12:46 Building evacuated, our upstream providers are down although our servers are still powered up.
1:00(ish) UPS depleted, generator fails, power is down.
1:45 Jason, our datacenter manager, bullies his way in with me while building is still evacuated.
2:00-3:30 We plan for all the fun stuff that happens when the power comes on. Upstreams come back up, although now our servers are down.
3:45 Datacenter power back on, we start powering up cabinets slowly.
4:15 Most of our public network up, firewall busted, private network down (which means our monitoring system is down, DreamHost site down, Web Panel down. Our blog is up because it’s just on a normal shared hosting account.)
5:00 Big problems come first: File server replaced, firewall being feverishly worked on, 3 public cabinets (out of 40) still down due to switch problems.
6:00 Firewall fixed (which lets us quickly identify continuing problems via monitoring system).
6:30 All public cabinets up, individual machines/services still wonky. Web panel, etc up.
6:30-midnight: Fixing individual servers/services. Some weird Web Panel redirection loop errors fixed. Webmail login errors fixed.
I do have to say, walking around and seeing aaaaaall our servers powered down and quiet was pretty creepy. I’ve been here for a long time, and seen plenty of nasty problems, but this one was particularly freaky.
We would have liked to have something like the above being posted live, but we were literally running and yelling and typing as fast as we physically could from 3:45 to midnight.
We are, of course, already digging into the issue of why power was ever out in this supposedly-very-prepared building in the first place. We’ve had grid outages before and never noticed a blip. There are LOTS of other internet companies in this building (including a bunch of other shared hosting folks) so yeah, we’re all pretty much going crazy about how badly the building handled things. We’ve always been told there are HOURS (not half-hours) of UPS capacity and that the generators are regularly tested and well-maintained.
Also, I bought a lottery ticket on a Red Bull run. I figure karma might balance things out . Keep your fingers crossed.