Uptime is everything in this business.
Unfortunately, Monday was kind of a rough day for us. You may have noticed that your sites were unreachable for most of the day.
The last time something so wide-reaching and disruptive happened was less than a year ago when we discovered, quite out of the blue, that we were hosting the website for “Draw Mohammed Day.” That was…fun. And educational. Unfortunately, the technical lessons we learned during that experience simply were not applicable to what happened on Monday.
Technically, every website we host was up and running on Monday.
However, they were mostly unreachable for much of the day.
We’ve just completed our own internal review and want you know exactly what happened, why it happened, and why we’re taking great pains to ensure that it won’t happen again.
So let’s start with…
Let’s look at the timeline.
Network connectivity becomes…quirky. Our core network begins to exhibit latency issues. We begin to investigate possible causes.
The main Cisco switch at our datacenter locked up and became unresponsive.
It’s clear the switch isn’t going to come back up on its own. We reboot it, and Cisco’s VSS (Virtual Switching System) fails over all of its traffic to our secondary switch. At this point partial connectivity has been restored. Sites are reachable, if somewhat slow.
It becomes clear that our primary switch’s configuration has been wiped clean – spontaneously – and attempts to recover it are failing.
In an attempt to recover some utility from our primary switch, we clone the config from our secondary switch with some tweaks to avoid a dual-active situation. Unfortunately that does not work.
We scrap the primary switch’s configuration – on purpose this time – and begin to rebuild it from scratch.
We enable port-channel VSL (Virtual Switch Link) on two channels. This causes our secondary switch to reload itself and makes our primary switch the active one – wiping out our configuration files again. Technically speaking, this should not happen.
We’ve restored configs to both switches. Say a few Hail Marys, crack our necks from side to side, and then discover that the chaos of the morning took down the link to our El Segundo datacenter.
Techs arrive at El Segundo and begin troubleshooting.
The link between datacenters is restored.
At this point things are looking up. Partial core connectivity is restored and we begin looking for smaller fires to put out.
Outage continues. The config on our primary switch appears to be corrupted.
Still more connectivity issues are reported. Config files are continually found to be mangled. Restoring from backups is not helpful as they’re discovered to be mostly out of date.
Not out of the woods yet, but the end is in sight. All network interfaces are audited and restored. Routing and switching are repaired. The majority of issues have been resolved.
Out of the woods. Full functionality is restored.
It means that a combination of factors, set into motion by what we believe to be a hardware failure in a key part of our network, caused many customers’ sites to be unreachable for much of Monday. Those that were reachable worked intermittently and slowly.
The fact that our core switch supervisor wiped its own configuration spontaneously – and continued to do so even after we restored and rebuilt that config manually – told us that the switch was not operating to spec and is a prime candidate for replacement.
We also witnessed other anomalous switch behavior during the recovery process that, according to Cisco, should just not be possible. We’re going through the RMA process with Cisco now.
We do. We rely heavily on Cisco’s Virtual Switching System (VSS) architecture to provide fault-tolerant network redundancy for situations just like this one. On paper, and based on our specific network environment, Monday’s problems should not have happened.
VSS should have stepped up to route traffic around the troubled hardware. It didn’t. We believe our network configuration is solid – and that VSS did not behave as it should have. In fact the VSS behavior we saw was unexpected and inconsistent with what VSS claims to be. We’ll be working with Cisco to determine the nature of the failure.
We would have loved to reach out to every customer individually, but with over one million domains hosted, that could – quite literally – have taken all year. We’d have loved to email you too, but well, we had this little network problem blocking emails.
Please bookmark and continue to check http://www.dreamhoststatus.com/ at the first sign of trouble. That domain is hosted offsite and we use it as our primary means of communication in cases of any and all planned – and unplanned – service interruptions.
You may also want to follow @dhstatus on Twitter as well, as it syndicates the post titles from dreamhoststatus.com.
Unfortunately, there really wasn’t much news to share as we constantly created, installed, and reloaded switch configuration files only to have them crumble and disappear before our eyes.
Our network admins were, quite understandably, wound pretty tightly on Monday and under the gun to get things back to normal again as quickly as possible. When they did stick their heads out of the bunker long enough to pass on status updates, we piped that out to dreamhoststatus.com immediately. You knew what we knew – as soon as we knew it.
We wanted to provide as much context and detail as we could to support the events already posted on the status blog. That meant doing some serious research and analysis.
We were busy doing our own internal assessment of the situation; figuring out exactly what happened, working on a detailed accounting of what failed and why it took so long to get things back to normal. That report is now complete and that’s why we’re able to share our findings with you now.
We learned some key things on Monday:
1. First and foremost, you want to be kept more in the loop. You want to feel as if you’re staring over our network admins’ sweaty shoulders, watching pages of text scroll by on the console. We get that. And you certainly deserve it.
We provided as much information as we could during the outage on our status blog throughout the day. That’s not enough. We’ll be posting more frequently, even if there’s really nothing to report, during future large-scale service disruptions. The nature of Monday’s problems meant that we really couldn’t even guess at an ETA for a resolution, and we didn’t want to make promises that we weren’t prepared to keep.
2. We’ll work to refine our network topography so that our datacenters aren’t so dependent on each other and operate more like freestanding units and less like an interconnected web of services.
3. We’re going to be better about keeping our network config backups up to date. We’ll keep several versions on file so that we can rollback to a last-known good configuration if need be.
4. Finally, we’ll be beefing up our network monitoring situation and implementing a centralized ‘Network Health’ page that any employee can use to get a bird’s-eye view of our networking situation at any time.
You’ve earned it! You pay for 365 days of service – not 364.375. Contact our technical support team and we’ll do what we can to make it right.
We were saving the most important part for the end.
We let you down, and truly, each and every one of us here behind the blue curtain gets a knot in their stomach thinking about what happened on Monday.
Some of you weren’t able to post to your blogs.
Some of you weren’t able to work on (or submit) projects for classes.
Some of you had nothing to show at SXSW.
Some of you weren’t able to accept online orders.
Some of you got yelled at by your boss.
Some of you weren’t able to get any business done.
All of that, and perhaps more, is our fault.
We appreciate your business and value the relationship that we’ve worked hard to build with each and every one of you.
It’s our hope that you’ll stick with us as we work to regain your trust by once again providing solid, dependable, hosting services.