DevOps: How to Monitor Virtual Server Resize and Migration Progress with the API

By Lee Kimber -

DevOps: How to Monitor Virtual Server Resize and Migration Progress with the API

monitor-cloud
This guide shows how to use the Rackspace Cloud Servers API to better track the progress of virtual server resizes. It shows how to better predict when a resized virtual server will go offline and come back online in its resized state. Very useful information to have when you resize production servers!

You can extend this API-based tracking method to semi-automate resize progress-tracking or to script your own resize progress-monitoring solutions.

First, a quick review of what Rackspace Cloud Servers – and OpenStack – do once a customer initiates a resize. Note, these same principles apply to virtual server moves and migrations initiated by Rackspace Cloud's Support and Operations teams when they move customers' virtual machines during a migration. The only difference in the case of moves and migrations is that the virtual servers will not change size.

A resize progresses through these steps:

Step 1. A copy of the virtual server's filesystem is created on another hypervisor. This creates a duplicate – and bootable – virtual server image. The original virtual server is left running during this process.

Step 2. The original virtual server is brought offline and any changes that have occurred since the original filesystem was first copied are now also copied over.

Step 3. The hypervisor's configuration file for the virtual server is copied over from the original hypervisor to the hypervisor that hosts the second copy of the hypervisor

Step 4. On completion of the second copy, the Rackspace Cloud boots the second filesystem. It retains the IP address of the original server.

Step 5. The Control Panel now reports that the virtual server is up and is in 'Verify Resize' mode. In other words, the system is waiting for the customer to confirm that the new larger/smaller virtual server is up and running OK before it deletes the original filesystem on the source hypervisor.

In the REACH Control Panel, all the above steps are displayed as just two steps:

1. Building (Step 1 of 2)

2. Verify (Step 2 of 2)

The step labelled as “Building (Step 1 of 2)” wraps the first four steps into just one action. You can add detail to your view of what is happening during this stage if you use the 'curl' tool to make HTTP requests against the Rackspace Cloud Server API. You cannot – unfortunately – unpack this “Building” stage into each of its four steps but you can separate step one from steps two, three and four. That will give you enough extra information to produce a more accuarate prediction of when the virtual server will go offline and come back online.

Here's how:

Pre-requisites:

You will need the HTTP querying tool 'curl'. It is available for Windows and Linux. Rackspace describes curl-installation at:

http://www.rackspace.com/knowledge_center/article/cloud-files-curl-cookbook

Get Your Data Together:

To use the API with curl, you first need to query the API with your Rackspace Cloud account username and API key to get an authentication token that you can use for subsequent API queries. How to obtain your API key and the ins and outs of obtaining an authentication token are described in detail at:

http://docs.rackspacecloud.com/servers/api/v1.0/cs-devguide-20091015.pdf

Distilled to its command-line curl essentials, you submit your API-key and account username to the API in the format:

curl -H "x-auth-key:RACKSPACE_API_KEY" -H "x-auth-user:RACKSPACE_USERNAME" https://lon.identity.api.rackspacecloud.com/v1.0 -i

which looks like this on the command-line:
$ curl -H "x-auth-key:d4750dkvkdft9dkdf9erdfd19df5ge2e3" -H "x-auth-user:testcustomer" https://lon.identity.api.rackspacecloud.com/v1.0 -i

All being well, you will receive a multi-line result similar to the below, which contains your token in the X-Auth-Token directive five lines from the bottom:

HTTP/1.1 204 No Content
Server: nginx/0.8.55
Date: Fri, 30 Nov 2012 09:34:41 GMT
Connection: keep-alive
X-Storage-Token: d4750dkvkdft9dkdf9erdfd19df5ge2e3
X-Storage-Url: https://storage101.lon3.clouddrive.com/v1/MossoCloudFS_d76dcd9f-fe24-4c58-a279-cfc0543082aa
X-Server-Management-Url: https://lon.servers.api.rackspacecloud.com/v1.0/10002194
response-source: cloud-auth
X-CDN-Management-Url: https://cdn3.clouddrive.com/v1/MossoCloudFS_d76dcd9f-fe24-4c58-a279-cfc0543082aa
X-Auth-Token: 9d11eef0-aafc-407e-a854-d0110e4703a7
vary: Accept, Accept-Encoding, X-Auth-Token, X-Auth-Key, X-Storage-User, X-Storage-Pass, X-Auth-User
Cache-Control: s-maxage=61451
VIA: 1.0 Repose (Repose/2.3.5)
Front-End-Https: on

You also need your account number and the long identifier – the UUID – of any server about which you want to acquire data. Both are visible in Rackspace Cloud's Control Panel – the account number in the top right by your username and each server's uuid by 'mousing over' the name of the server when displayed in the server list.

Query the API:
For this example, assume you have your token account number and the UUID of the server you wish to resize as below:

X-Auth-Token: 9d11eef0-aafc-407e-a854-d0110e4703a7
Server UUID: c4e6f56c-1f1a-4ba0-8af7-a60427ade306
Account Number: 20009196

You will substitute these values into the following curl command:

curl -H "X-Auth-Token: TOKEN" https://lon.servers.api.rackspacecloud.com/v2/ACCOUNT_NUMBER/servers/UUID

to create a command that looks like:

curl -H "X-Auth-Token: 9d11eef0-aafc-407e-a854-d0110e4703a7" https://lon.servers.api.rackspacecloud.com/v2/20009196/servers/c4e6f56c-1f1a-4ba0-8af7-a60427ade306

However, if you issue the command like this you would have to repeatedly issue it to catch the reported changes as the resize progresses, so automate your update requests every five seconds by prefixing the command with 'watch -n 5'. Your fully-formatted command now looks like:

watch -n 5 'curl -H "X-Auth-Token: a44c0e51-88c1-4336-87dd-925b213fb59f" https://lon.servers.api.rackspacecloud.com/v2/10002194/servers/b4e6f56c-1f1a-4ba0-8af7-a60427acf306'

The resulting output is long and looks like:

{"server": {"status": "ACTIVE", "updated": "2012-11-28T13:58:59Z", "hostId": "de9fd64b798ec1967f83ff0ec515ae240b6f04d1759fe40b755ec26c", "addresses": {"public": [{"version": 4, "addr": "5.79.19.117"}, {"version": 6, "addr": "2a00:1a48:7804:0110:0b05:d430:ff08:21e2"}], "private": [{"version": 4, "addr": "10.178.133.199"}]}, "links": [{"href": "https://lon.servers.api.rackspacecloud.com/v2/10002194/servers/b4e6f56c-1f1a-4ba0-8af7-a60427acf306", "rel": "self"}, {"href": "https://lon.servers.api.rackspacecloud.com/10002194/servers/b4e6f56c-1f1a-4ba0-8af7-a60427acf306", "rel": "bookmark"}], "image": {"id": "0bad246a-f4e7-44d6-9d81-d54833470309", "links": [{"href": "https://lon.servers.api.rackspacecloud.com/10002194/images/0bad246a-f4e7-44d6-9d81-d54833470309", "rel": "bookmark"}]}, "OS-EXT-STS:task_state": null, "OS-EXT-STS:vm_state": "active", "flavor": {"id": "2", "links": [{"href": "https://lon.servers.api.rackspacecloud.com/10002194/flavors/2", "rel": "bookmark"}]}, "id": "b4e6f56c-1f1a-4ba0-8af7-a60427acf306", "rax-bandwidth:bandwidth": [{"interface": "public", "bandwidth_outbound": 145067, "bandwidth_inbound": 16821299, "audit_period_start": "2012-11-28T00:00:00Z", "audit_period_end": "2012-11-28T17:05:30Z"}, {"interface": "private", "bandwidth_outbound": 552, "bandwidth_inbound": 11550470, "audit_period_start": "2012-11-28T00:00:00Z", "audit_period_end": "2012-11-28T17:05:30Z"}], "user_id": "793", "name": "lb_test1", "created": "2012-11-15T10:37:14Z", "tenant_id": "10002194", "OS-DCF:diskConfig": "AUTO", "accessIPv4": "5.79.19.117", "accessIPv6": "2a00:1a48:7804:0110:0b05:d430:ff08:21e2", "progress": 100, "OS-EXT-STS:power_state": 1, "metadata": {}}}

But you are only interested in a couple of pieces of data in this output and their positions in the output makes them easy to find. Almost right at the start, you see

"status": "ACTIVE".

You will want to watch changes to that “ACTIVE” flag, which gives us a broad indication of the virtual server's status.

Secondly, right in the centre of the screed of output is:

"OS-EXT-STS:task_state": "null", "OS-EXT-STS:vm_state": "active", "flavor": {"id": "2"

Various parts of this center-string will change as a resize progresses. They are:

“task_state” - indicates what customer-requested action is in progress for this virtual server.
“vm_state” - indicates a more detailed view of the action in progress for this virtual server.
“id": "2" – indicates the virtual server's package size, as described in http://docs.rackspacecloud.com/servers/api/v1.0/cs-devguide-20091015.pdf

For easier reading, I've shown how they change through a resize in the table below.

Resize Stage status task_state id
Resize not initiated ACTIVE null 2
Resize request issued in control panel, first copy under way RESIZE resize_migrating 2
Second copy under way RESIZE resize_finish 3
Resized slice booted VERIFY_RESIZE null 3

This table shows that while the Control Panel conflates the first and second filesystem copies into one 'Resize' stage – reflected in the “status” variable – the “task_state” variable differentiates between the two copies, labelling the first as “resize_migrating” and the second as “resize_finish”. Importantly for production systems, you can use this extra granularity to predict approximately when the slice will go offline as part of the second filesystem copy.

OpenStack instructions are not yet fully exposed so you do not see specific notice of the virtual server shutdown command that precedes the second copy. However, when the “task_state” variable switches from “resize_migrating” to “resize_finish”, the system will take a few seconds to set up the two hypervisors ready for a second copy. Similarly, it will take a few seconds to issue a shutdown command to the virtual server. These actions will be queued, acted upon, their results passed back to the controller, and the next steps then requested. The precise timing of these steps depends on how busy the system's queues are, but as a rule of thumb you can expect at least 30 seconds to pass between seeing the system switch to “resize_finish” before the original virtual machine is issued with the “shutdown” command, the second copy made, and the new virtual machine booted up and brought online.

Also, the higher the activity on the virtual server – that is, the larger the number of services, connections, and open files to close down – the longer it will take for the original server to shut down. These same factors influence how much time the new virtual server will require to boot and become operational again. These customer-dependent variables are, of course, not visible to Rackspace and so their effect on shutdown and restart times is not predictable.

However, if you have left the virtual server able to respond to ping requests, or can test its ability to accept incoming connections, you may be able to script tests that indicate for you when the server goes offline and the new, resized server comes online.

I've included the “id” indicator in the table above because it enables you to double-check that the new size delivered was the one you ordered!

Track Resize-Reversion Progress:

Sometimes you will want to revert a resized virtual server back to its original size.

A resize-reversion is a very simple. It progresses through these steps:

Step 1. The back-end system powers down the second – or copied – virtual server.

Step 2. When the second virtual server is powered down, the back-end system powers up the original virtual server.

The Control Panel reports the progress of a resize-revert very simply: it shows the status of the virtual server as “reverting”. When the original virtual server begins to power up, the Control Panel reports the reversion as complete.

Viewed through queries to the AP, this reversion process presents only two conditions to watch for:

1. That the reversion is in progress
2. That the reversion is complete

These conditions show up in the same three status and state keys as you use for tracking resize progress.

Shown in table form for clarity, they are:

Resize-Reversion Stage statustask_state vm_state id
Resize-revert not initiated VERIFY-RESIZE null resized 3
Reversion in progress REVERT-RESIZE resize_reverting resized 3
Reversion complete ACTIVE null active 2

And that's it!

Unfortunately these techniques do not apply to FirstGen servers because the FirstGen API updates much less dynamically as resizes and resize-reversions progress.

However, for OpenCloud virtual servers, you can incorporate these techniques with shell or high-level language-based scripts to develop the manual and semi-automated progress checks shown here into more advanced automation scripts.

td
{
padding:15px;
border:1px solid;
}
table
{
border:1px solid
}