Support: 1-800-961-4454
Sales Chat
1-800-961-2888

Implementing High Availability (HA) For Rackspace Private Cloud

1

At MySQL Connect 2013, I spoke about and demoed High Availability (HA) of MySQL for OpenStack. It was very well received. I’ve heard from many customers who view HA as too complex and do not want to touch it with a 10-foot network cable because there are a number of interrelated concepts and it becomes a pain to implement and test them all. On the other hand, the consequences of not implementing High Availability could be disastrous to a business.

I am really impressed by the simplicity and elegance with which HA is implemented in the Rackspace Private Cloud and I decided to write about it.  In this article, I’ll try to demystify some of the concepts of HA and explain how they are implemented in the Rackspace Private Cloud.

Of course, the beauty of OpenStack is that if you prefer an alternate HA method you can go ahead and implement it. Many of the concepts will still be applicable no matter what implementation you use. Once you have an understanding of these concepts you can make applications that run on top of OpenStack highly available, you can build a solution yourself or you can evaluate and buy a third-party solution depending on your needs.

I am scheduled to deliver a HA talk around this at the upcoming OpenStack Summit Hong Kong next month. Feedback is strongly encouraged and welcome!

HA Concepts

HA involves:

  1. Minimization of data loss and
  2. Minimization of system down time

HA is measured in terms of 9s. For instance, a system that is 99.99 (four nines) percent available means it’s down for less than one hour per year.

HA is achieved by eliminating Single Points of Failure (SPOF) by creating a redundancy of resources such as nodes, switches, routers, storage and power, AC facilities and so on.  Here, we look specifically at application-level HA.

The migration of a service from a primary resource to a secondary resource is referred to as failover. The migration of the service back to the primary resource is referred to as failback. A switchover is when the migration is initiated manually.

Redundant infrastructure can be in Active-Active mode, which means that the failure of a single resource will usually result in the degradation of a service. In an Active-Passive mode infrastructure, the services are continually monitored and a backup or secondary resource will assume the role of the primary resource in the event that it fails.  This is generally referred to as a “speed bump,” since it takes a finite amount of time for the failover to happen.

Services can be stateless or stateful. There is no dependency between requests in a stateless service and there is a dependency between requests in a stateful service. HA with a stateful service is more complex since it usually goes hand in hand with data replication.

Since redundant nodes are involved, data needs to be replicated between the nodes. There are different forms of replication including master-master, master-slave and multi-master. The difference between these forms will dictate the consistency models and how scale out can be achieved.

To be able to install HA, the services identified in this diagram will need to be provided on a per-service basis.

In this example, there is a redundancy of nodes. The Replication Services deal with data replication (as in MySQL master-master replication). The Health Check and Cluster Communication modules monitor the heartbeat of the service (as in Keepalived and HAProxy) and manage failover or failback as required.

Let’s look at how these concepts relate to the Rackspace Private Cloud.

HA for Rackspace Private Cloud

The Health Check module is implemented via Keepalived, which is based on Linux Virtual Servers (IPVS) kernel module and provides layer 4 load balancing.  It maintains Health Checkers to monitor the health of the service and take corrective action should that service fail. Virtual Router Redundancy Protocol (VRRP) eliminates SPOF by using a Virtual IP (VIP) for the service and binding to the service that is running on the functioning controller.  Keepalived uses VRRP to move the VIPs around. MySQL and RabbitMQ service is managed via Keepalived. HAProxy manages load balancing for HTTP- and TCP-based applications – it load balances the API services. The following diagram illustrates how this works.

It’s worth noting that Keepalived, VRRP and HAProxy are not fundamentally new. They are used in concert to implement HA on Linux systems based on the concepts that were discussed earlier. Rackspace Private Cloud HA for the API services, which are stateless, is Active-Active and HAProxy will load balance between the nodes. MySQL and RabbitMQ, which are stateful services, are Active-Passive. MySQL uses master-master replication and RabbitMQ uses the built-in replication feature of RabbitMQ Cluster. Rackspace Private Cloud installs and configures the requisite modules using Chef cookbooks and recipes.

If an application uses the VIP instead of the IP address of the respective controller, the service will still be available in the event of the failure of the original controller node.

Now, let’s look at how to implement HA on the Rackspace Private Cloud.

Implementing HA for the Rackspace Private Cloud

In the blog post installing Rackspace Private Cloud in 20 minutes or less, I outlined how to install Rackspace Private Cloud on the open cloud without HA. I followed the same (or similar) steps to install Rackspace Private Cloud on a virtualized environment on my laptop with HA.

Add the VIP entries in the environment file rpcs.json in the override_attributes section as shown below. In this example, the API services are bound to 192.168.210.197, MySQL is bound to 192.168.210.198 and RabbitMQ is bound to 192.168.210.199. This set of IP addresses was randomly chosen from the IP addresses in the DHCP server range (192.168.210.0/24).

    "vips": {
            "cinder-api": "192.168.210.197",
            "glance-api": "192.168.210.197",
            "glance-registry": "192.168.210.197",
            "horizon-dash": "192.168.210.197",
            "horizon-dash_ssl": "192.168.210.197",
            "keystone-admin-api": "192.168.210.197",
            "keystone-service-api": "192.168.210.197",
            "keystone-internal-api": "192.168.210.197",
            "nova-api": "192.168.210.197",
            "nova-ec2-public": "192.168.210.197",
            "nova-novnc-proxy": "192.168.210.197",
            "nova-xvpvnc-proxy": "192.168.210.197",
            "mysql-db": "192.168.210.198",
            "rabbitmq-queue": "192.168.210.199",
            "config": {
               "192.168.210.197": {
                   "vrid": 11,
                   "network": "management"
               },
               "192.168.210.198": {
                   "vrid": 12,
                   "network": "management"
               },
               "192.168.210.199": {
                   "vrid": 13,
                   "network": "management"
               }
            }
        },

After you have modified rpcs.json, install two controller nodes using Chef. The following command installs a primary controller on the node named controllerhaone:

knife bootstrap controllerhaone -E rpcs -r 'role[ha-controller1]'

The following command installs a secondary controller on the node named controllerhatwo:

knife bootstrap controllerhatwo -E rpcs -r 'role[ha-controller2]'

The second command will also discover the other controller and install the recipes required to monitor the services and to failover the service via the VIP from one controller to the other in case of a controller failure.

After installing the controller nodes, executing the ip addr command shows that the VIPs are bound to the primary controller (controllerhaone):

root@controllerhaone:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:61:ff:95 brd ff:ff:ff:ff:ff:ff
    inet 192.168.210.103/24 brd 192.168.210.255 scope global eth0
    inet 192.168.210.198/32 scope global eth0
    inet 192.168.210.197/32 scope global eth0
    inet 192.168.210.199/32 scope global eth0
    inet6 fe80::20c:29ff:fe61:ff95/64 scope link 
       valid_lft forever preferred_lft forever

In the event of the failure of the primary controller the VIPs will “move” to the secondary controller (controllerhatwo) via the services that have been installed as part of the Rackspace Private Cloud HA install. Running the ip addr command before and after the failure of the primary controller will yield results similar to the following example:

root@controllerhatwo:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:e8:dd:0d brd ff:ff:ff:ff:ff:ff
    inet 192.168.210.104/24 brd 192.168.210.255 scope global eth0
    inet6 fe80::20c:29ff:fee8:dd0d/64 scope link 
       valid_lft forever preferred_lft forever

After a failure of the primary controller, the VIP is now bound to the secondary controller as shown in the following example:

root@controllerhatwo:~# ip addr
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN 
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 scope host lo
    inet6 ::1/128 scope host 
       valid_lft forever preferred_lft forever
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UP qlen 1000
    link/ether 00:0c:29:e8:dd:0d brd ff:ff:ff:ff:ff:ff
    inet 192.168.210.104/24 brd 192.168.210.255 scope global eth0
    inet 192.168.210.198/32 scope global eth0
    inet 192.168.210.197/32 scope global eth0
    inet 192.168.210.199/32 scope global eth0
    inet6 fe80::20c:29ff:fee8:dd0d/64 scope link 
       valid_lft forever preferred_lft forever

The services continue to work even after a failure. That’s all there is to it!

Summary

HA on the Rackspace Private Cloud is implemented using Keepalived, VRRP and HAProxy. As we discussed in this article, making OpenStack services Highly Available is as simple as providing the requisite failover information in the environment file and installing the appropriate roles on the controller nodes.

This implementation works on master-master for MySQL replication. When there are more than two controller nodes involved, an alternate form of HA will be required. The OpenStack High Availability Guide is an excellent resource and discusses alternate methods of HA such as Galera, DRBD, Pacemaker and Corosync. The Rackspace Private Cloud Knowledge Center goes more in-depth into the installation process.

If you are at the OpenStack Summit in Hong Kong, stop by my talk and say hi!

About the Author

This is a post written and contributed by Raghavan "Rags" Srinivas.

Raghavan "Rags" Srinivas works as a solutions architect at Rackspace where he finds himself constantly challenged from low level networking to high level application issues. His general focus area is in distributed systems, with a specialization in Cloud Computing and Big Data. He worked on Hadoop, HBase and NoSQL during its early stages. He has spoken on a variety of technical topics at conferences around the world, conducted and organized Hands-on Labs and taught graduate classes in the evening.

Rags brings with him over 20 years of hands-on software development and over 10 years of architecture and technology evangelism experience. He has evangelized and influenced the architecture of a number of technology areas. He is also a repeat JavaOne rock star speaker award winner.

Rags holds a Masters degree in Computer Science from the Center of Advanced Computer Studies at the University of Louisiana at Lafayette. He likes to hike, run and generally be outdoors, but most of all he loves to eat.


More
1 Comment

Nice one, Rag! This is a really good explanation of HA concepts and how to implement it on RPC.

avatar Miguel Alex Cantu on November 19, 2013 | Reply

Leave a New Comment

(Required)


Racker Powered
©2014 Rackspace, US Inc.