Support: 1-800-961-4454
Sales Chat
1-800-961-2888

Rackspace Open Sources Dreadnot

5

At Rackspace we pride ourselves on Fanatical Support in all we do. The public face of Fanatical Support is having knowledgeable people ready to help customers when they need it. But behind the scenes, Fanatical Support means leveraging technology to improve every aspect of the customer experience.

The Rackspace Cloud Monitoring team has been working on ways to improve the lives of our customers by developing tools that will not only deliver a better support experience, but will allow us to deliver the features that our customers need quickly and reliably while avoiding service interruptions. Today we are open sourcing Dreadnot, a piece of technology that enables the continuous deployment of software.

Continuous Deployment

Most people who come from a background in large-scale software have a horror story about a failed deployment.  As a reaction to this, many companies have procedures for releasing new builds and manually testing them.  The Cloud Monitoring team chose a different path.

Rather than deploying less frequently with more manual testing, we deploy more frequently, relying upon a culture of test-driven development, code review and extensive quality assurance automation to catch bugs early and minimize service interruptions.  Our maxim is that a new engineer should be able to push code into production on their first day on the job.

This helps us bring the Fanatical Experience to our product by ensuring that the time between a reported bug and the production bug fix is as short as possible ensuring a continuous flow of improvement to our software.  One might assume that extra time was wasted in building more sophisticated testing and infrastructure, however it pays off with more efficiency in other areas.

For example, when dealing with slower deployments, there is a constant overhead associated with diverging release and development branches, which becomes more complicated as more members work on new features simultaneously.

As the change surface of the release increases in size, assuring that any pushed build contains the desired fixes and features without breaking old functionality becomes more difficult.  Smaller incremental builds make it easier to track down production problems because they contain only a small number of features, instead of worrying about which of the twenty features in the release actually broke.

Multi Region Rolling Deployments

Continuous deployment doesn’t have to mean constantly dumping our code to all our servers and waiting for something to break. Since its inception, the Cloud Monitoring product has been designed to withstand widespread system failure. Our motto is “First to know. Last one standing.” Monitoring can’t be allowed to fail during a major data center failure – disaster scenarios are when our customers need monitoring to work the most. Rackers around the globe work to ensure major failures are as rare as possible, but the monitoring system must always be prepared.

Much of Cloud Monitoring’s resilience comes in the form of cross-region redundancy. Data points are gathered from five data centers around the globe, and every data point is independently processed in three. This gives us valuable options when it comes to deploying. We can take a data center offline with no customer impact, upgrade services running there, then bring it back online while carefully monitoring the impact of the upgrade.

Of course this doesn’t mean we can push broken code without causing problems, it merely increases our chances of detecting certain classes of issues before they impact customers.

One key to maintaining our developer culture is having a single canonical way of doing things. There should be one correct way to run tests and one correct way to bundle our code. That is not to say each of these processes is simple; a cross-region rolling deployment is a many step process. But as engineers, what do we do when we encounter repetitive multi-step processes?

Automation

The Cloud Monitoring team started by using Etsy’s Deployinator, but it didn’t meet our needs perfectly. The Deployinator was developed for a single region product, and took some shortcuts, but the basic ideas were sound.  We were also looking at using Deployinator for multiple products inside Rackspace, and each team was faced with creating many customizations in Deployinator to fit the models we desired.  Due to this, we developed a new project that we are open sourcing today called Dreadnot.

Dreadnot Your Deployments!

Dreadnot is a relatively simple Node.js application built on top of the Express web framework and Twitter’s Bootstrap Javascript and CSS utilities.  It provides a control mechanism and easily accessible view into the deployment process.

In Dreadnot there is the concept of a Stack, a series of tasks to deploy a specific piece of software.  The Stack defines how a deployment works and what code is being deployed.  For example, in Cloud Monitoring we have one stack for our monitoring pollers and another for our API services.

Under each Stack is a set of Regions.  For each Region we track the currently deployed version of a stack with the most recent version available on Github.

Under each Region you can see the complete history of a deployment in that region.  For an individual deployment, you can go back and view the entire log with all the details, or view the diff link for the changes that happened in Git.

When a deployment is started, the log is streamed to all active clients using Socket.IO, and notifications are sent to plugins, which can include IRC or E-Mail.

Git, Buildbot, Apache HTTPD and Chef Integrations

We built in a deep integration with our infrastructure to ensure both a high quality product and a seamless user experience during the deployment of a Stack. Dreadnot finds the target revision SHA we wish to deploy from Git, and then talks to Buildbot about that specific revision.   It then ensures that all test cases have passed in Buildbot, and that a tarball has been generated for this revision.  If these builds haven’t started, Dreadnot will trigger Buildbot to build them, then wait for Buildbot to complete the tests and make sure the release tarball is available.

Once the build is tested and ready, Dreadnot reconfigures the load balancers in the target region.  Using the balancer-manger feature of mod_proxy, it drains requests to the local API servers and sends all traffic to API servers located in a different region.  This temporarily increases the request latency for some customers, but they experience zero downtime at the HTTP API level.

Dreadnot then modifies a databag in our Chef server that references the revision it built for this deployment. The software then uses a parallel SSH and execute chef-client on the machines in the desired region.  Dreadnot uses a triggered chef-client command instead of using daemon mode because we wanted it to control exactly when other non-code changes are made.  Both code and configuration management changes introduce risk into the environment. The Cloud Monitoring teams believes the best time to roll out Chef recipe changes is when customer traffic is already shifted to another region, so we wanted to treat recipe changes similar to a code deployment.

Our Chef recipe downloads the remote tarball from our build servers, extracts it, updates a symbolic link and begins restarting services.  Once all of the chef-client runs are complete, Dreadnot runs tests against the upgraded servers and validates that the new version is running successfully.

If any of these steps fail, Dreadnot will stop to wait for human intervention and continue directing traffic to another region.  Dreadnot was developed to assist with the most common multiple region deployments. However, for the complicated deployments, or those deployments that experience a fatal error, you can proceed manually without interference from Dreadnot.

Assuming everything worked Dreadnot then reconfigures the load balancers to bring back traffic to the region it just upgraded.  This process is then repeated for the remaining regions.

We handle staging and other environments by giving them completely isolated and separate Dreadnot instances.  We chose to partially do this for security reasons in addition to preventing accidents, so that testing and staging infrastructure are completely isolated from production.

Fork It

Dreadnot is open sourced under the Apache License version 2.0, and we hope it can be useful in deploying your own projects.  Rackspace has started using it on two different product teams inside the company, so while there are still some areas that could be made more generic and more features to add, we believe Dreadnot is at a good starting point.  Our team would love to see ideas from the community and pull requests to help make Dreadnot more helpful to everyone.

Conclusion

The Rackspace Cloud Monitoring team is fanatical about continuous deployment.  We love being able to iterate quickly on our product and believe that our customers will get the best experience possible by doing so. If you would like to try out Rackspace Cloud Monitoring product, be sure to fill out a the Private Beta application survey.  Additionally, we are hiring folks folks who are interested in solving these kinds of problems from the inside.

Paul Querna is an engineer on the Rackspace Cloud Monitoring team.  Be sure to check out Paul’s blog and follow @pquerna on Twitter.

About the Author

This is a post written and contributed by Paul Querna.

Paul Querna is an engineer on the Rackspace Cloud Monitoring team. Be sure to check out Paul’s blog and follow @pquerna on Twitter.


More
5 Comments

Great article, and really glad to see you guys open-sourcing tools for the world to use.

Having said that, could you PLEASE make the annoying javascript ‘live chat’ button on the left margin of this blog go away. Move it to the empty right margin, or at least move the left margin of the blog over to the right? Basically anywhere but covering the words you want me to read. That is exactly in the spot where I normally keep my eyes to read a page, and I had to either scroll up or down to read the article, and force myself to read above or below where I am comfortable. It’s horrible UX to cover the content people come here to read, without even the option to dismiss it.

Keep up the great work with OSS.

avatar Michael W on January 5, 2012 | Reply

Michael –

Thanks for your feedback. That is a good point and we will look into getting the slider moved so it is not covering the important stuff.

-Angela

avatar Angela Bartels [Racker] on January 6, 2012

Thank you for sharing how the professionals execute release management in an effective automated fashion for a high availability architecture.

Have your intensive managed hosting teams or critical sites teams been trained on using these tools? Would they be able to assist end users to configure this type of process for websites hosted with that offering?

avatar Steve Stonebraker on January 6, 2012 | Reply

Hi! Nice stuff :) Any plans to add Jenkins integration in addition to Buildbot?

Cheers!
-jay

avatar Jay Pipes on January 13, 2012 | Reply

Puppet integration would be a great feature as well.

avatar James Martin on January 18, 2012 | Reply

Leave a New Comment

(Required)


Racker Powered
©2014 Rackspace, US Inc.