Anatomy of a Disaster, Part 2

For the past several weeks many of you have been faced with slow or unusable websites and email. The original cause of that series of issues was detailed in Josh’s great Anatomy Of An Ongoing Disaster post. The network issue we were left with once the power outage problems were mostly resolved ended up being an especially nasty one. We were essentially caught with our pants down at just the wrong time and we’ve been taking our lumps for it.

Sour Face

The evidence we were seeing all pointed to one of the two routers as the primary troublemaker so we focused on that one. Configurations were changed with some improvement but without resolving the main issue. Ultimately, 6 separate Cisco support engineers and a Cisco Certified Internet Engineer were all unable to determine the cause of the errors we were seeing on our routers. That, along with the recent power outages, eventually led everyone to believe there was a hardware fault within the router somewhere. That started our process of replacing and/or upgrading every component. Once that was done and the main problem was still there we were able to finally pinpoint the point of network congestion and resolve it, and that’s where we are now.

The problem ending up being the connection between the two routers. Our network was set up so one router was primarily responsible for some of our servers, and the other router was primarily responsible for the rest of the servers. Both routers are connected to outside network connections and they share those roles providing wide-area network redundancy, but the inside of our network (our LAN) relied on both routers working together and passing bits back and forth. Some of you did not experience the problems because all of the servers your service relies on were on the same core router and were not bottlenecked by that inter-router link. Once one of the routers was fully upgraded we were able to move all traffic to that single router thereby removing the bottleneck and restoring service completely for everyone.

Our routers were not redundant and that hurt us. If our routers had been redundant we could have much more easily moved all traffic to one router or the other just to eliminate some variables. Having that option would have saved us a lot of time and you a lot of painfully slow service.

Ugly Dog

Establishing Power Redundancy

In searching for this solution we wasted a lot of time uncertain about the integrity of our equipment. Whenever a piece of electrical equipment suddenly loses power there is always a chance of some component failing and when you’re dealing with a device as complex as a router that’s a lot of components to worry about! If our data center’s UPS and generator setup had worked properly and the routers had not lost power, we could have instead focused on the new evidence at hand, confident that nothing else had changed. Knowing that, and knowing the track record of our data center, we are already in the process of adding an additional layer of power redundancy for our most critical (and expensive to replace!) equipment. The DC powered equipment housed in our data center is backed by a secondary UPS system and did not lose power throughout the recent power fluctuations. To take advantage of that ourselves, we are converting to DC power at the core of our network. We have the power supplies sitting and waiting to be installed and we’re currently waiting for the power to be wired into the racks we need it in.

We are also expanding our space in our Alchemy Communications Data Center. Alchemy has set up their own UPS backed power feed and were not hit as hard by the power outage that took us down. All of our future data center expansion is going into Alchemy.

Big Batteries

Establishing Network Redundancy
Looking back, our worst mistake of this ordeal was allowing our network hardware to end up in a state where we could not redirect all of the traffic to one router or the other. Having that option earlier on in the process would have allowed us to debug the problems more easily and ultimately we would have solved the problem faster. There’s no doubt about that.

When our two current core routers were originally deployed either one of them was able to handle the full load of the network. They were set up to share networking duties and we could have redirected traffic to one or the other if that ever became necessary. Unfortunately, the routers were not upgraded when they should have been and we ended up in a state where one of them was not able to handle the full load of the network. That situation combined with the problems beginning with the power outages led to the nasty network congestion that was difficult for us to diagnose and resolve.

Currently we are using a single router at the core of the network. Every component has been replaced and most of them have been upgraded so it is essentially brand-new and very able to handle our network traffic for the time being. We are in the process of re-establishing core router redundancy now and expect that to be done in the next few weeks. As we proceed into the future we will ensure that one of the two routers is always handling the full load of the network and the second router is standing by idle as a hot spare, should the need for it arise.

Redundancy

Into the Future
While investigating this issue we have been forced to look more closely at our network than we have in a long time. That has uncovered more issues that may become larger problems for us down the road and we are already working on a large scale network reorganization to both improve overall performance and make network issues easier to detect and troubleshoot. If there’s a silver lining on this dark cloud, this may be it.

Our primary local area network setup is really two separate networks, one for traffic that never leaves our network (the private network) and the other for traffic that does mostly leave our network (the public network). When you access your website traffic has to go over both the public and private networks (possibly multiple times) before you will see it come up in your browser. During our network problems it was primarily the private network responsible for the high server loads and slow website load times and email access.

The first step we are taking to improve our network setup is to completely separate out our private network from the public network. That will immediately reduce the amount of traffic going through our core routers and additionally make it easier to track down problems. More equipment will be involved but network traffic will be more isolated. As part of this process we will also be rearranging network links in as close to an optimal way as possible to further isolate traffic and improve performance. Unfortunately due to limitations in our current network architecture the best we can do is about 30% optimal and it’s likely we will not even do that well.

Less Than Optimal

So, the next step in the process is a complete rethinking of how we have been deploying our servers in our data center. For ease of deploying servers and efficient use of data center space we had architected our network to essentially allow any type of server (web server, email server, file server, mysql server, etc) to live anywhere on the network. That sort of setup has worked well for us for awhile but we are now starting to see the early signs of network bottlenecks arising. For future server deployments, we will be assigning physical areas in the data center for different types of servers to facilitate a more optimal network layout between them. That will essentially localize the network traffic as much as possible and allow us to continue scaling for quite some time into the future. Overall network flow will be reduced as well, better utilizing the available throughput. This step is currently being planned and will be implemented first for the next set of servers we deploy.

All told we will be investing somewhere in the neighborhood of $300,000 into our network upgrades, not to mention all of the human time involved in planning and implementing these changes. Now that we have gotten this issue behind us we are fully committed and prepared to maintain network stability and do the work needed to improve network performance and continue to scale with our growth.

We are very sorry for all of the headaches this has caused everyone. Believe me, there was no one who wanted this problem resolved more than we did. Providing sub-par service is no fun and isn’t the way we like to spend our time. This problem took longer than it should have to resolve, but coming out of it we are now in a much stronger position as we look ahead.