Another Anatomy

X-Rays are used to explain a lot of things at DreamHost.

Okay, nothing silly this time, I promise…

Some of you may have noticed that we’ve been having what a problem that is, although maybe not the worst in DreamHost history, definitely in the top 5.

There has been a DreamHost Status post about it, but it’s been going on so long, there obviously needs to be more said.

This wasn't the first disaster.

The History

The events that conspired to cause this horrible performance for everybody in our “blingy” cluster actually started to take root 19 months ago.

That was when I made this post asking our customers for some suggestions on storage. I made the mistake in that post of mentioning the name of one particular storage vendor who apparently does a search for their name in rss feeds of all kinds of blogs. I won’t mention their name again here, to test if they REALLY read this blog, but they were the one on the list right after “Netapp”.

Anyway, immediately a sales guy from there was hounding me about how great their product was. It would have super-duper reliability, super-duper performance, and super-duper ease-of-management. It was super-duper expensive compared to our current solution (about 3x the price per GB), so in the end I declined.

But, over the next year he kept hounding me and hounding me, and eventually the price came down to something in line with our current costs, so we decided to try one unit for our new cluster, “Blingy”. After we were satisfied with our internal testing, Blingy went live with the new storage solution in December 2007.

No need for life boats!

Smooth Sailing

At first, everything was fine, performance was great, everybody was hunky and dory. But then, as usage started to go up, the new file system started acting up. Around the same time every night, the system would stop responding to NFS requests for a while, which would immediately break web and mail service for everybody in the entire cluster.. thousands of customers.

Our Bad

Now, it can be a big mistake to put live customers on any new system. But honestly, we’d tested it lots, researched it a ton, and we added people very slowly at first, and it performed great.

Our biggest mistake I believe had nothing to do with what specific vendor or hardware we went with.. it was simply putting so many eggs in one basket!

Even with our Netapps (which are pretty much awesome), there are problems from time-to-time. However, a typical hosting cluster will have a dozen or so Netapps, which means any problems are one twelfth as big.

With Blingy, EVERY customer is on this one “mega” filer, which in theory should make for better performance, reliability, and ease of management. And since we got the clustered solution (in an active-active configuration)… there really is no single point of hardware failure in this thing.

But, as it turns out, there are a lot of non-hardware failures in the world.

Their Bad

Well, the techs at the vendor couldn’t figure out what was causing the NFS freezing, and so they recommended us doing a major OS upgrade to hopefully fix it.

During this whole time, the fiber channel disks were slowly filling up, and we’d been trying to move large files off to the sata pool (it’s a two-tiered solution, and there’s a feature that automatically moves less-accessed data to lower tiers).. however the thing couldn’t move the data fast enough. It couldn’t finish doing a “move job” in a single day, and every day it’d sort of “crash”, which would screw up the move job, and nothing would get moved.

Also, as the disk kept getting more full, performance kept getting worse, creating a vicious cycle. We ordered some more fiber channel disk shelves at the end of February to grow the main FC volume, since we couldn’t get things off to SATA, and it was supposed to come on March 10th and be installed at the same time as the major OS upgrade.

However, the disks didn’t end up getting installed until March 25th, and at that point it turned out we could NOT grow the FC volume with these disks (well, it was technically possible, but their on-site techs recommended VERY VERY heavily against it.. it would severly impact performance), which was sort of the whole point. So now we had a new FC volume which we still had to migrate users to.

The Exxon Valdez ain't got NOTHING on us!

Your Bad

Of course, this whole time, new customers just kept signing up, and being added to Blingy. What were you guys thinking?

By this point we knew this was a bad idea, but we didn’t have a new cluster ready (we’d expected Blingy to grow for another couple of months), and we try to never ever grow old clusters again once they’ve been “shut off” from new signups (because in time they stablize and have very few problems).

However, the moving people off to the new FC vol, or the original SATA vol, or even the new Netapp we also added to Blingy, just wasn’t happening fast enough. So on April 2nd we bit the bullet and switched Blingy off as the “new customer” cluster and started growing good old “Postal” again. Once we did that, we were finally able to get ahead of the curve and total usage on our first fiber channel volume has been slowly dropping ever since.

We tried at that point to contact the vendor to see if we could just get more drives that WOULD allow us to grow fcvol1, but they said their manufacturers were closed for inventory for a week after the end of the quarter and we couldn’t get anything until Friday, April 11th at the absolute soonest. Later they said they could find us some they could get us by Tuesday, April 7th, and we preliminarily said we’d take them.

This whole time we had a support ticket open with the vendor about the crashes (the OS upgrade didn’t fix it), and finally on April 3rd we received notice that they’d fixed the bug that they believed was causing it! However, the patch still needed to go through their “QA”. Finally, this Sunday April 6th they said it was all ready to be deployed, so last night we did.

What Now

Well, right now, performance is still not great on fcvol1… but mail and web should be pretty much working. One thing we’ve noticed is a website that hasn’t been visited in a long time will have a big lag still upon the first visit.. but then subsequent reloads/visits seem much faster.

At least the total disk usage is coming down now, and hopefully by tomorrow it’ll be below 85% which is supposedly a magic number where performance is fine. We’re going to keep off-loading it until things are great, though. We’ve got plenty of disk space for it, the problem is just it takes so long to move it.

We also I guess will find out tonight if the NFS freezing bug is fixed by this new patch. Hopefully so.

Apologize this kung-fu kick!

It’s Too Late…

I realize this is probably too little too late for many of you, but I just wanted to sincerely apologize for this whole big Blingy cluster-f*ck. Also, if you’re on Blingy (you can tell from the panel by clicking “account status” and looking at “Your Email Server”, we’d like to offer you a month worth of hosting credit.

To get it, all you need to do is contact support from our panel and make the subject of your message “Blingy Account Credit”. That’s all you have to do, and we’ll credit everybody who asks (and is actually on Blingy!) next Monday (April 14th).

Very funny, Mr. Happy Blingy Customer.