Last week DreamHost experienced a widespread system outage that impacted service for a great number of our customers with services hosted in our “PDX1” data center. Many of our systems were unreachable, and less than half of our customers’ sites experienced periods of downtime.
Service was largely restored within 12 hours, and now that the dust has settled we wanted to provide you with an update as to what happened, why it happened, and what our plans are to prevent it from happening again.
In short: On the morning of Thursday, November 2nd, one of the data centers that houses a large number of our servers lost power and its redundant power systems failed.
This should not have happened. Our data center in Hillsboro, Oregon (“PDX1”) is run by Flexential, a proven leader in data center construction, management, and operations. Flexential is responsible for providing power to our servers in this facility.
To their credit, their operational plan for dealing with power issues follows industry best practices and their redundant power systems are a key component to what is, by all accounts, a state-of-the-art facility. However, as the events of last week have shown, the reality of an unexpected power event can have unforeseen implications and a ripple effect that can reverberate across the Internet.
A standard and common configuration for power redundancy within most data centers is to build in two fully redundant power systems. Each system obtains its power from a utility via a unique, redundant path. Each system also contains its own bank of UPSs (uninterruptible power supplies – aka “emergency batteries”) and a fleet of diesel generators sits onsite to power the entire facility if need be.
While a full report from Flexential is forthcoming, what we saw from our end was a partial loss of power followed by a complete loss of power to our fleet of servers. We want to be clear – this should have been an “impossible” condition and we had all assurances that it was, including a 100% power availability service level agreement (SLA). These power systems are tested regularly and undergo regular, scheduled maintenance to ensure they will perform as intended.
During a typical data center power outage (planned or unplanned), UPS batteries kick into action automatically just long enough for the facility to activate its diesel-powered generators.
It is unclear why or how the UPS system failed, the generators failed, or how both of these automated, redundant, independent power systems managed to fail so spectacularly at the same time. We believe this to have been a combination of a utility failure as well as a generator and UPS failure. A full investigation is ongoing and we expect to receive results shortly.
Regardless of the cause, our focus and our priority during this event was to bring our machines back online and to restore service to our customers.
We were first alerted to an outage at 4:41am local time on Thursday, November 2nd, via our own offsite monitoring tools. We immediately dispatched members of our Data Center Operations team to the facility to begin the process of bringing services back online. We published a status post shortly thereafter to help customers follow along with service restoration efforts.
Once we realized the full scope of this outage, our entire executive team was paged and placed on alert, while every specialist on our Infrastructure team (both local to the data center and those working remotely) was brought in to bring systems back online.
At some point during this response, the building’s access control system also lost power, making it a bit of a challenge for our team to gain entry. When full power was finally restored to our portion of the datacenter at 6:08am, the redundancy that we’d built into our own internal power infrastructure worked as designed and as expected.
Unexpected hard reboots and loss of power – at any scale – can cause both hardware failures and unexpected behavior in software. As expected, we saw plenty of both.
While a single desktop PC or laptop may be able to gracefully recover from an unexpected loss of power, that is unfortunately not the reality within the context of a large data center installation. With thousands of servers and dozens of switches installed at this location, it was a careful process (well documented and executed) to bring systems back online, test each of them for anomalous behavior, and to ultimately take corrective action as needed.
While no customer data was ever at risk, we did have to replace more than a few hard drives and sticks of RAM throughout our fleet of servers. The unexpected power cut also caused some network switches to revert to older versions of their firmware, requiring upgrades and restorations from previously saved configurations.
After a long day of cleanup and many long hours put in by our technical teams, we were able to finally mark all major systems as restored, and we continued working into the night to identify and repair any additional systems that needed attention. We resolved this incident at 4:49pm on Thursday, just under 12 hours from the initial power disruption.
Many of our customers saw service fully restored in under an hour. Others had to wait much longer. It was truly an all-hands-on-deck day for us in the data center, and we appreciate the patience and grace that many of you have shown in your messages to our Support team.
We’re in conversation with Flexential this week to understand where the failure(s) happened and what their plans are to mitigate this exact scenario from occurring in the future.
If you ever experience issues with your DreamHost-hosted sites and suspect a wider system outage may be the cause, be sure to make https://www.dreamhoststatus.com/ your first stop for information. Updates on our system status are also cross-posted to @dhstatus on X.
If you were impacted by the events of last Thursday, you have our sincere apologies.
We realize that you chose DreamHost, not a data center, to be your trusted online partner. You shouldn’t have to worry about who provides services to your website “further upstream”. While we wanted to provide clarity into this event, we understand that the buck stops with us.
We’re sorry for the absolute inconvenience that this has caused to your sites, your businesses, and your online reputation. We will do everything in our power to ensure that an event like this does not reoccur.