Recently, DreamObjects suffered a prolonged outage. The service was largely inaccessible for a period of approximately six hours on Friday, June 13th, and again for a second period of approximately four hours on Saturday, June 14th. We wanted to give you some insight into the what and why of the outage.
First, for context, let’s discuss how DreamObjects works. DreamObjects exists as a cluster of servers, where each server has several disk drives. Most servers in the cluster run a piece of software named Ceph, which is a distributed, fault-tolerant storage system.
A distributed storage system is one that stores data across multiple drives, and frequently those drives are spread across multiple servers. A fault-tolerant storage system is one which can handle the loss of one or more drives, or even whole servers, or racks of servers, without losing data. Ceph (and thus DreamObjects) is both distributed and fault-tolerant. In a nutshell, anything you send to us gets written to three drives to protect the integrity and durability of your data.
That’s not to say that every machine in the cluster has the job of storing your data. A few – the gateways – are the machines that you actually talk to via the S3 or Swift HTTP APIs. These gateways then issue requests on your behalf to the machines which actually store data.
And that brings us to root cause of the outage: TL;DR the gateways couldn’t talk to the back-end storage servers, and so the service appeared to be “down” despite the fact that the underlying storage system was actively working; busy ensuring that your data was safe.
On Friday, a change was made to the “CRUSH map,” which tells Ceph how data should be split up between the servers and drives which make up DreamObjects. These changes needed to happen, but they shouldn’t have been made all at once. This resulted in so much data moving around inside the cluster backend that requests from the gateways couldn’t be handled in a timely manner. For the safety of data – which is the first job of any storage system – Ceph gives internal movement of data priority over requests from the gateways. Our engineers had effectively, and quite inadvertently, created an internal denial-of-service attack against DreamObjects. The change to the map was made as a result of an urgent issue with the Ceph cluster itself. This change, however, was made in haste and should have been rolled out more slowly during a prescheduled (and announced) maintenance window.
Friday’s outage was worsened by a bug in the version of Ceph which we were running at the time. This bug was making individual drives become inaccessible on a repeated basis and that drastically increased recovery time from the map change. About midway through the outage we decided to upgrade Ceph. That’s ultimately what allowed the cluster to return to health much quicker than it might have normally.
On Saturday another bug was uncovered by the stress of continued cluster recovery and that also caused us to lose access to individual drives repeatedly. Our friends at Inktank graciously rolled a custom build of Ceph for us to take care of the issue.
In short, the outage to DreamObjects was a result of an overly aggressive change to our Ceph configuration and was then worsened by latent bugs within Ceph. In the end we were able to quickly resolve all issues with Inktank’s help and Ceph itself was strengthened thanks to our findings. It is also important to note that the underlying storage system worked as designed, ensuring that customer data was safe and never at risk.
We are absolutely committed to ensuring high quality and availability of all our services. The DreamObjects team has identified a number of changes to our code, processes, and procedures to ensure that this kind of outage will not happen again. Namely, the operations engineers will be auditing “CRUSH map” changes through a more rigorous review process, including external review by Inktank, the creators of Ceph. In addition, changes of this scale will be announced well in advance, and performed in off-peak hours to limit risk.
One last thing.
All DreamObjects customers will receive 10% off their bill for the month of June, 2014.
Many thanks are due to our colleagues at DreamHost in Datacenter Operations, Technical Support, and Cloud Operations; and the great people over at Inktank, for working tirelessly to restore availability, and protect customer data.