Protection against and recovery from host server down (HSD) issue

The Rackspace Cloud is supported by thousands of hardware Host Servers that work together to create the virtual space we refer to as the Cloud. Each of these Host Servers supports a small piece of the Cloud. Your Cloud Server resides on one of these Host Servers.

Hardware failures are inevitable, and sometimes a Host Server experiences a hardware failure, known as a host server down (HSD) issue that causes it to go offline from the Cloud. When this happens, all Cloud Servers that reside on that particular Host Server go offline with the Host Server. Most of the time the hardware issue can be corrected and the Host Server placed back online in a short period of time.

Rackspace configures host servers with a redundant disk configuration (called a RAID) to help protect your data. Sometimes, however, hardware failures are so severe that bringing your Cloud Server back online takes an extended period of time, or the data on the Host Server is corrupted beyond recall and your Cloud Server is lost with all the data on that server. This article addresses two main points: Protecting your solution from HSDs and other types of outages, and Recovering from an HSD and other types of outages.


Protecting your solution from HSDs and other types of outages

If your solution relies on a single Cloud Server, then your solution goes offline for any issue that causes the server to be unreachable from the Internet. Such issues can vary widely: an HSD issue, a networking issue, or an issue on the Cloud Server itself such as faulty code or a misconfigured environment.

The answer to protecting your solution from going offline is redundancy. Redundancy is having multiple pieces of your solution doing the same thing at the same time so that if one of those pieces goes offline the other pieces continue to carry the load and your customers continue to get service.

Consider the following diagram:

This diagram shows a load-balanced solution with two web servers and two database servers. This is a minimal configuration that protects both your web presence and your data.

The load balancer acts as the point of entry into your solution and balances the incoming customer traffic among all of your web servers. As the point of entry, the IP address assigned to your load balancer is used for your solution, so it is the only IP address that you need for your DNS. Redundancy is already built into our load balancers, so one is all you need for your configuration. Another advantage to using a load balancer is that the web tier (the number of web servers) can be expanded or contracted as needed to handle increased and decreased traffic.

In this configuration, your web servers are all identical, so which server your customer uses does not matter. If one of your web servers goes offline, the other server or servers continue to handle the customer traffic. This redundancy protects your web presence.

The database servers are separate from the web servers so that all of the web servers have real-time access to the same data at the same time. The second database server mirrors the primary database server and can become the primary database server if the current primary database server goes offline. This redundancy protects your data as well as your online presence. Additionally, separating your database servers from your Internet traffic creates a more secure environment and provides added protection for your data.

Of course, you are thinking that this solution is more expensive than a single server solution, and you are right! It is! But consider this: how much does an hour of downtime cost you, in both money and reputation? Only you can answer that and determine if keeping your solution online and protecting your data is worth the additional expense.

In addition to the previous solution, it’s always a good idea to maintain a current backup copy of your data outside the Cloud and maintain a recent or daily image of your web and database servers here at Rackspace. Solutions such as Rackspace Cloud Backup and Cloud Server Snapshots are very inexpensive.


Recovering from an HSD and other types of outages

When you have a single Cloud Server solution, recovery can be problematic. When your server is down from an HSD issue, or for any other reason, your main concern is getting your solution back online to serve your customers. Our operations team works as fast as possible to get your solution back online, and normally this happens quickly.

If time is of the essence and you need to get your server back online as fast as possible, or if your HSD resulted in total data loss, here are some helpful steps you can follow:

    1. Rebuild your server from your most recent backup image.
      If you are not using a load balancer or if you need to keep the same IP address, contact Rackspace Support (1 800 961 4454) before rebuilding the server. Tell Support that you need to keep the same IP address for the server in the rebuild process. If the issue is close to being resolved, they might not perform the rebuild.
    2. If you are using a load balancer, configure the load balancer to use the new server’s IP address.
    3. Restore your data from your most recent backup copy of your data. If you use Rackspace Cloud Backup, this process is easy. If you use some other method then you will probably have to manually copy your data to your server.
    4. Bring your server online.

      The preceding steps might have the following issues:

      • What if you don’t maintain backup images of your server?
      • You can create a new server from scratch, and load your software and data onto that server. Configure your server as needed. Keep backup images in the future.
      • What if you don’t have a backup copy of your data?

      This is bad news, and we don’t have a solution for you. Your data is the heart of your business, and you should protect it. Consider Cloud Backup or other backup solutions to regularly back up your data outside the Cloud environment.

As always, you can call Rackspace Support for advice and assistance at (1-800-961-4454).

After a recovery of this magnitude I’m sure you have an excellent idea of what an hour of downtime costs you and your customers. Before it happens again, consider the suggestions outlined in the Protecting Your Solution From HSDs and Other Types of Outages section in this article.

© 2015 Rackspace US, Inc.

Except where otherwise noted, content on this site is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License

See license specifics and DISCLAIMER