Update on Email Service Interruptions

Filed in by Rackspace Blogger | June 13, 2007 7:46 pm

I want to take the opportunity to apologize to our customers and partners for the two periods of poor email performance you have experienced in the last 10 days. We have been in the email business for several years now and there has never been a period of time where we have let our customers and partners down in such a fashion.
Some of you have been with us for a long time and know that this is not typical. Some of you are new customers or partners and are probably thinking about leaving us. I cannot blame you. I only hope you give us a chance to live up to the expectations you deserve. I am as committed as ever to exceeding your expectations and I will not rest until we do—and neither will my team.
I also want to emphasize that the events we have experienced in the last 10 days have been very isolated. They are in no way, shape, or form, an indication of the health of our technical infrastructure or our company as a whole. In fact, we have invested quite heavily in our infrastructure in the last several years and aside of these isolated events, those investments have been paying off in a big way.
Let me first give you the technical details behind what happened:
1. First, on Monday, June 4th, a traffic surge overloaded one of our load balancers in our Dulles, VA data center. Luckily we have several of these load balancers and a backup load balancer automatically kicked in. This temporarily fixed the first load balancer but overloaded the second. So then a third load balancer kicked in. Customer traffic was bounced around from load balancer to load balancer repeatedly putting added strain on our load balancing system and compounding the problem. This prevented customers from checking email until the load on our system settled down.
2. Second, on Wednesday, June 13th, while fixing a minor issue on our beta mail system, an engineer made a mistake that caused a problem with a cluster of live databases that are used for routing mail traffic internally within our network. Our system automatically reacted by holding all inbound mail on our gate servers in order to ensure the delivery once the issue was resolved. During this time customers experienced login problems because their POP, IMAP and webmail connections could not be routed properly. The problem was resolved for more than 99% of our users within 20 minutes, and the remaining users were fixed over the course of the next 15 minutes.
Though these two incidents were completely unrelated to one another, customers felt the same symptom – they couldn’t access their email. There are several things we are changing in order to prevent future incidents. Here is what we are doing:
1. Bill Boebel (our chief technology officer) and his team have a complete understanding of why these events occurred. They are continuing to review both incidents, learning from them, and putting policies, procedures, and technology in place to make sure they do not happen again.
2. After the first incident, we made three significant changes to to our load balancing system in order to prevent anything like this from happening again. First, we eliminated the possibility of traffic bouncing around multiple times between load balancers so that an issue with one load balancer does not compound itself, as happened on Monday. Second, we extensively reviewed the traffic logs from Monday and made optimizations to the load balancing software that effectively doubles the capacity of our hardware. Third, we reorganized the traffic that is managed by the load balancers so that each machine will operate at or below 25% capacity when traffic surges to the level that was seen on Monday. Additional load balancers will be added as necessary as we grow in order to keep this level of performance.
3. We are making a significant change to our beta mail system. We will be putting a new strong layer of separation between the beta system, which requires frequent changes and tuning, and the live system, which must have very infrequent changes in order to keep the service ultra reliable for our customers.
4. Kevin, our chief software architect, is going to temporarily reorganize his software development teams to focus on enhancing our customer ticketing and communication tools. These isolated events triggered such an influx of customer support that our current tools simply broke down. This caused our communication with customers to falter during such an important time. As most of you know, we are a very customer centric organization and while our people have been working harder than ever to help customers, our supporting technology has failed them. We are going re-architect these tools and improve the way we provide you with critical information and updates.
5. As I write this, I am on a plane to San Antonio with Bill. We are going to visit Rackspace, our managed hosting provider. Tomorrow and Friday we plan to review these two incidents, show them our growth plans, go over the new technology and processes we are putting in place, and get their feedback. Rackspace is an expert when it comes to infrastructure hosting and we always get a lot of value out of these meetings.
Lastly, I want to reiterate that our system is as stable as ever. We plan to learn from these isolated incidents and provide you with better email services and customer communications than ever before.
I appreciate your support,
Pat
CEO, Webmail.us, Inc.

Source URL: http://www.rackspace.com/blog/update_on_email_service_interr/