As the end of the decade approaches, it’s a perfect time to look back and reflect on the past. What has gone wrong and what has gone right? We have a way of making waves, even if not always in the way we might like. Here’s some reflections on some of those waves.
Starting about 4 years ago, we were battling a problem of power constraints at our data center. That led to a general inability to provision data center space the way we wanted to and we had to become very creative to manage our system. The biggest problems from that time were two unplanned power outages of the entire building followed by an emergency planned outage 8 months later. Power is one of the lifelines of a data center and not being able to rely on it can be a major distraction. It was during that period that we established our off network status site (dreamhoststatus.com), which has proven to be a great asset. Those issues are long behind us now, and looking forward we have plenty of power capacity.
Pretty much in the middle of the power situation we were also hit with a networking problem between our two core routers that was causing serious website slowness. We had grown so quickly that nobody really understood exactly how the network devices were interacting. That combined with the distraction of the power situation made it take way too long to resolve the problem. From that, we learned the hard way what we needed to do to improve and maintain the network. Things have improved, but we were still recently hit with a couple network outages. They were caused by human errors and weaknesses in the procedures we had in place. We have already refined those procedures and future improvements to the networking infrastructure will also help to mitigate the potential for human error to result in outages. This is still very much a work in progress but huge steps have been taken already and more are still to come.
The next major hurdle we faced involved our data storage infrastructure. In the early days we had migrated from disks inside each server to network attached storage for the added redundancy and to allow us to better utilize our available storage. At that time hard drives were only 9 gigabytes and our users were generally only using a hundred megabytes or less of space. Boy, have things changed! We now have users with multiple terabytes of data and even everyday websites sometimes have multiple gigabytes of files. The huge growth of online video and the popularity of digital photography increased the demands on our storage infrastructure by a couple orders of magnitude over the years. To accommodate that growth, our simple system of file servers had grown to a large network of over 100 individual servers that we had to constantly juggle data between. The cost per gigabyte for our Network Appliance based system had also not come down nearly as fast as the per-user storage requirements had gone up, but we had become reliant on network storage for things such as backups, rapid recovery from server failures, and seamless sharing of data between separate hosting and email accounts. We were addicted to network storage, and our next couple of major performance problems came from gloriously failed experiments with other storage products. One of them was cheap and unreliable, and the other was expensive and unreliable. We couldn’t win! (Note that we haven’t used either of those for awhile so both of them may work better now than they did for us a couple of years ago.)
Through those years that our addiction to network storage had developed, a shift had happened that we hadn’t noticed. First, individual hard drives had dropped like a rock in price and skyrocketed in capacity. Second, users were consuming data at such a rate that evenly utilizing our available storage was no longer a problem. Switching back to locally installed storage was the answer we had been looking for! We started experimenting with the new server architecture and developing a backup strategy. Then once the pieces were all in place, we started moving forward with it. So the storage bottleneck was resolved and we had a clear path forward, but we were still throwing out a lot of knowledge we had learned over the preceding decade and were starting over from scratch in a lot of ways. With every technology shift comes with it a new set of problems and this one was no exception. It’s been about a year and a half since then and we’re already on our fourth revision of the server configuration (three different RAID cards with two different configs). The current hardware has been working out quite well but we’re doing some testing to see if it can be optimized further.
Looking into next year, our core technology systems are under control and we’ll be able to focus more on improving the service than we have in years. The future looks very bright…. but more on that tomorrow!