The DreamHost Firefighting Brigade: Effectively Communicating Internal Issues

No web-host is perfect. Every host tries to do their absolute best for their customers. And a large customer-base means larger responsibilities, so DreamHost takes a unique approach managing problems on our network.

We love to hire nerds, or at least people who are passionate about what they do. Everyone at DreamHost is given an opportunity to learn, and even the newest techs are encouraged to safely fix specific issues as they crop up, using a variety of scripts we developed. Any team member can submit scripts for review by our development department, and in many cases, they’re put into production.

Our support team also has an interesting metric to work with – customers! In certain cases, Tech Support detects trends in incoming support requests. The information we get from customers can be used as supplemental data, along with our monitoring software to quickly find reports of or the cause of widespread trouble. Our handy support tracking system show submitted tickets as sortable by many parameters, which makes spotting trending issues a snap.

Luckily for DreamHost customers, the support team isn’t the sole source of stability reports. Our Datacenter Operations department (or DCOps for short), as well as our Systems Engineering team keep up on reports from our extensive monitoring software. Together they form our emergency response team, like a specialty firefighting brigade (minus the cool foil suits) calendar shoots, and brawny mustaches. Oh, and muscles.

The DreamHost Emergency Response Team

Day and night, Tech Support is in constant communication with our basement buddies through instant messenger and phone if needed. The majority of customer-reported issues are investigated and passed on to the DCOps or SysAdmin department if a solution is not quickly or easily found. Chances are, the monitoring software is already reporting some symptom of the problem, but there are many cases where our support team brings more immediate attention to the troubles at hand, thus expediting a fix. This communication, along with email, is how our support team  keeps up-to-date on current events.

In the event of a widespread outage that affects many customers, the VP from every department is notified immediately – no matter what time of day or night. By having them online and active during an outage, we ensure that tasks are delegated properly. We also treat outages as a learning experience, and expect every department to reflect on how they were affected, as well as what preventative measures can be taken to ensure the most pleasant and stable hosting experience for customers.

With a large customer base comes lots of servers, and with lots of servers comes the potential for thousands of individual services to report at any given time. Our DCOps team does a fantastic job keeping the short-term number of alerts down, while the Systems Engineering team keeps things stable in the long run through the magic of kernel and package configuration, security patches–and a pinch of old-world magic. Our front-line Tech Support squad adds that additional layer of human recognition to our already comprehensive reporting service, thus giving customers the most expedient solutions to problems.