It All Falls Down

My apologies.

On the off-chance (and judging by that graph of our Level 1 queue, it seems like a pretty good off-chance) that a few of you may have noticed a little problem we had last Thursday afternoon, all the way through Friday morning, I thought I might offer something in the ways of an explanation to go along with that apology.

You customers really notice no DNS!

It’s funny how problems cascade.

It all started Wednesday around noon, when we had a sudden and mysterious network problem related to our core 2 router.

There seemed to be some sort of corruption with the ARP tables.. we eventually figured it out, and fixed it thanks to a gazillion sendARPs. Cisco support wasn’t helpful because we weren’t running the latest version of their IOS router operating system. Unfortunately, upgrading is scary stuff since it requires a short network outage, assuming everything works smoothly. We decided we’d do the upgrade Friday night.

Come Thursday at 2pm, exactly 24 hours after our previous outage was fixed, our network started to get wonky again. It seemed like it was most likely due to all the sendARPs from the previous day expiring at the same time. We were pretty much on top of this as soon as it happened though, and re-sent the sendARPs (staggered this time)!

In fact, it wasn’t actually due to an aging issue at all, but it was just an IOS bug on the core router. No big deal, we pretty much had things under control should the same problem pop up again on Friday at 2pm before the planned upgrade Friday night.

One pizza after another, all laid neatly on end.

A Chain of Events

Of course, little did we know, a chain of events had already been set in motion that would ruin everybody’s Friday.. FOREVER.

You see, every hour we have a little script that runs that purges old dead entries from our active nameserver database. Really, it isn’t the end of the world for us to keep that old stale stuff around, but in the name of being good dns citizens, I guess it’d been decided a while ago to remove them quickly.

Which is fine, I guess. However, the method in which we decided what entries should be removed was a bit suspect.

We first create a hash of ALL good domains “%domids” from our hosting database. Then, we go through all domains (as “$domid”) in our nameserver database and do:

unless ($domids{$domid}) {
print “- removing stray records under non-existant domain $domidn”;
$pdb->do(“DELETE FROM records WHERE domain_id=” . $pdb->quote($domid));
}

Which works pretty well, assuming everything is working pretty well.

Well, everything was not working pretty well on Thursday. Because of the network weirdness, the connection to the hosting database apparently didn’t work, leaving %domdids blank.

And, due to the excellent error handling and sanity checking of that script, it did not die at that point, or even so much as raise an eyebrow as it happily decided to delete every single domain in our dns database.

I think I can see my site in there..

Now, for bad or good, it didn’t just hose the whole table at once. Instead, it just deleted one database after another, in order.. which turned out to be a rather slow process on a busy dns database. In fact, 22 hours later when we finally found it STILL RUNNING (normally it finishes in under a minute since there’s nothing to delete) it had only deleted a third of the domains in the table.. about 300,000. Hooray!

It actually would have been a lot better if it’d just hosed everything at once. It would have been much easier to detect, and rectify, immediately.

Instead, things worsened gradually. It took over two hours before we even started getting reports from customers that their sites were down. At that point, it seemed like the problem was just some sort of residual effect of the network problem, and re-generating DNS for each person who wrote in fixed it right away, and for good.

As time went on, and the problems kept coming in, we realized there was a pretty major data loss in the nameserver database, and started running some scripts to regenerate it all. Those would take a couple of hours, but when they were done everything would be better, we assumed!

It wasn’t until those regeneration scripts finished and we discovered there were still lots of missing domains that it finally dawned on us .. dns records were continuously being deleted!

And THAT is when we finally found the culprit, fixed the mess, and started trying to make sure this would never happen again!

When it rains, we’re poor.

And where was DreamHost Status for all this?

DreamHost Status was down. (See, if you just read DreamHost Status you would have known that!)

Like they’ve said befores, when it rains it pours.

We thought DreamHost Status was down because of the huge crush of people trying to access it due to the network problems. So, when we could finally get into it, we switched it to a static html page to try and lighten the load.

Lighten the load it did not!

Right about then we got a message from our remote data center in San Francisco (both ns2.dreamhost.com and dreamhoststatus.com are kept completely off our main network and in a different city exactly so they wouldn’t be affected by outages like this!)

Your server’s switchport has been de-rated to 10 Mb/s because your server began generating an out-bound storm of packets. This type of event usually indicates a compromise in security.

We have taken this action to mitigate the amount of bandwidth transfer charges incurred by your account related to this activity

Man, what timing! We did not need a DDoS attack right now.

But wait a second. Somehow that just seemed a little bit TOO Murphy-esque. And, indeed, when we probed them further, they told us:

According to my monitor, it appears you’re being DDoS attacked on your DNS service (UDP 53) specifically to IP 208.96.10.221. At 5a,
your traffic peaked our threshold for dangerous amounts of packets going through your switch port which was when your server was de-rated.

That “Distributed Denial of Service” attack was actually just honest DNS requests!

Which was super-high because ns1.dreamhost.com was returning “I don’t have any records for that domain” for a ton of domains, due to the deletion of the DNS database entries, due to the haywire script, due to the network blip, due to the IOS bug, due to us not upgrading as quickly as possible because of the network downtime involved!

After Math is Art!

The Aftermath

Well, we did the IOS upgrade and it looks like it fixed the networking problems.

We also made our crazy script do some sanity checking. But more importantly (and in just two lines of code!), we’ve now set all our internal scripts to just DIE MISERABLY if they ever get any kind of un-good data from an sql query. Clearly, ’tis better to not do something you were supposed to then to do something you were not supposed to!

We’re also going to separate good old DreamHost Status from absolutely everything else DreamHost related.. even if that means moving it to blogger or something!

We must break the cycle!