The fight for stable Private Servers

As I’m sure some of you have noticed, the stability of some of our PS servers has been spotty at best from roughly the end of November.  What started out as an emergency kernel upgrade to fix some pretty serious newly-released exploits turned into months of non-stop bug hunting that resulted in the discovery of not one bug as we’d originally thought, but 4!  To make matters even worse, these 4 bugs were spread across 4 completely separately distributed pieces of the kernel which meant there wasn’t really anyone outside DreamHost who’d been likely to encounter our particular group of issues.

The first symptom we noticed was some hosts (ok, a lot of hosts…on the order of 30/day) were simply rebooting themselves.  The problem here was they were rebooting themselves so quickly that most of the time they hadn’t even stored any logs related to what was going on!  After closer inspection and a bit of luck, we found the dreaded “PANIC” string in their kernel logs.  Here’s the thing: normally when a server runs out of memory, it’s a Really Bad Thing.  When you’re talking about a virtual server, however, things are a bit less “doomsday scenario”.  It turns out that the Linux-Vserver patch we were using was failing to check exactly what part of the system it was that’d just run out of memory and if any guest ran out, BOOM.  Down went the host (we have them set to automatically reboot in such cases to speed their recovery).

Incidentally, the semi-panic caused by the lack of logging for such an immediate crash prompted us to write a new system that lets us remotely log all sorts of debugging activity so we can always be sure it’ll be available for later use.  With any luck, we’ll never be delayed in our fixing of a stability issue ever again for lack of information.

So after fixing the suicidal servers we’d been dealing with (that first bug took about a week to track down and roll out fixes for), we were feeling pretty relieved.  Then we noticed that while we were no longer having 30 machines crash every day we still had 20!  CRAP, we thought, what else could be wrong here?  Thankfully it didn’t take long to see that it was a bug in one of the security-related patches we use (thanks to the new-fangled remote logging system!).  So off we go to upgrade to the latest release which already fixed the bug (how lucky was that???).  And that’s where bug #3 comes in.  In one of our average PS hosts, we almost always see around 30,000 file handles in use at any given time (a file handle is basically what’s used by an application to read from or write to anything, be it a regular file, the network, whatever the case may be).  After upgrading we noticed something weird.  After just a couple hours, file handle usage was TEN TIMES the usual.  In order to ease some aspects of management, we decided a while back to boot some of our servers off of network storage.  One of the kernel patches that makes that possible is called AUFS (Advanced Unification File System).  After much back and forth with its developer, we finally got a patch back that fixed the problem.  That took a couple more weeks (and yes, we’re moving away from that system entirely).

Phew, 3 kernel bugs.  What are the chances, right?  After all, we didn’t make THAT big a jump in order to fix the security holes.  We were feeling pretty unlucky, but at least the problems were finally behind us.

That’s when we noticed that we were still having about 10 hosts crash every day (before the upgrade we’d maybe see 2-3 crashes per WEEK).  Unlike the old crashes, we no longer saw any real pattern between the machines that were crashing and the ones that were stable.  Some used the AUFS code we thought may still be buggy, but some didn’t (the split was actually almost perfectly 50/50 every day).  All we knew for sure was that some trigger was spontaneously causing an entire machine to cease being able to process anything at all, requiring a heavy-handed reboot to fix.  We spent weeks talking with the Vserver developers, talking with our own in-house kernel developers (the guys working on the CEPH filesystem), and anyone else who would listen.  The funny thing about bugs in other peoples’ software is that no matter how much proof you give them that YOU can trigger the bug, they’re rarely willing to put too much effort into fixing it unless you can show THEM how to trigger it themselves.  After a week of late nights and little sleep, we finally came up with a reproducible method of triggering the bug (for the more technically inclined, it involved a malloc() of just a bit more memory than was available to the PS environment, followed by an fread() to fill it up and trigger an OOM).  Even with the code in hand that proved the bug was, in fact, to be found in the Vserver kernel patch (or potentially the main kernel, though we weren’t able to trigger it there) it was still another week before anyone was able to figure out exactly what was going on.  One of the things that both made it so hard to find the bug and so obvious that the bug was either in the mainline kernel or the Vserver patch was the near-complete rewrite of a lot of the code related to what happens when the server runs out of memory.  As it turns out, one of the things that the Linux kernel attempts to do when a process is killed in order to free up memory is it gives it the highest priority it can and (and this is the important part) gives it a little bit of extra memory.  Yes, when a Linux server triggers its “OMG I’m totally out of memory!” routine, it’s not actually out of memory.  And this is where the Vserver patch comes in.  The way that it’s designed, it is impossible to get that little extra bit of memory that’s sometimes required for a process to die gracefully.  What happens in that case is you suddenly have a process with access to 100% of one CPU core that simply doesn’t have anywhere to go.  Once that happens, you can pretty much say goodbye to your server (and all the Private Servers it hosts).  The solution from the patch developers?  ”Get rid of all our memory management and use the kernel’s built-in Cgroup support”.  And this is why we we really like these guys.  A lot of software developers out there would let their egos get in the way and demand to come up with their solution.  These guys were happy to say “You know what?  The kernel already has a pretty complete mechanism for just this thing and we’d hate to duplicate all the functionality.”  And in case you were wondering, Cgroups are pretty new and didn’t exist when the first Vserver patches were developed.

We’re still rolling out upgrades to some hosts on an as-needed basis, but the results are extremely promising.