Part 2: How To Build Fault Tolerant Cross-Region AWS Virtual Private Cloud Communication
jerryhargrove
In my last post, I spent some time setting up and experimenting with cross-region VPN connectivity.
I wanted to create a secure communication channel between VPCs in different regions and ended up choosing software/hardware to build it. In one region, I setup an OpenSwan VPN appliance on an EC2 instance and connected it to a VGW in another region. This connection allowed EC2 instances in both regions to communicate with each other using private IP addresses. Awesome! And it took less than 30 minutes to set-up, start to finish, and looked something like this:
In my dev environment, I’ve had this setup running for quite a while with no issues. The software was easy to configure and operation is transparent to other components in the system.
But thinking longer term, if this were to go into a production environment, would I be satisfied with it? Probably not. I’m sure it would continue to operate fine, but as AWS CTO Werner Vogels says, “everything fails, all the time.”
This means at some point this will fail, it’s just a matter of time. I can leverage the built-in redundancy and availability provided by the VGW and dual VPN tunnels (only using one now), but the one OpenSwan instance is still a big single point of failure. To achieve any form of high(er) availability, I’ll need to address that.
Availability
What does it mean to be “available?” Does it mean your application is fully functional, top to bottom, from web app to database? Or does it simply respond to a ping request?
What about “highly available?” It seems a subjective and relative term. What it means to be highly available to me may not quite suit your needs. I might be able to live with five minutes of downtime a day, whereas you may only be able to tolerate five minutes of downtime a year. Two nines (99%) is highly available for me; five nines (99.999%) may be how you define it.
Regardless of definition, in order to maintain any desired level of availability, I’ll need some way of determining current availability. In the case of my cross-region connection, this includes knowing if instances in both regions are able to communicate with each other.
How do I know if the VPN connection is working properly? When I set up the connection between regions, I was able to answer both of those questions to verify the system was working correctly and was available for use.
First, I used the AWS console to determine the VPN tunnel status was ‘UP’, indicating the OpenSwan instance was connected to the VGW and a VPN tunnel had been established.
Second, I used a simple ICMP ping to the nature EC2 instance in both regions could see each other and that route tables and security groups had been configured correctly.
Turns out I can use the same two methods to monitor the status of the VPN connection after setup, albeit with a bit of automation and scripting to help out.
Monitoring
The first metric I want to collect is the status of the VPN tunnel itself, indicating connectivity between the OpenSwan instance and the VGW.
Most AWS services come with a useful set of metrics to use for monitoring the health and performance of the service or resource. Unfortunately, VPN tunnels aren’t one of those resources and don’t have specific metrics associated with them.
During setup, I had to use the console to check the status of the tunnel, and fortunately the same can also be done via the AWS CLI or any one of the AWS SDKs. Monitoring the status of the tunnel then becomes a matter of periodically polling for the status, then publishing the status as a custom Cloudwatch metric.
This can easily be done as a cron job running on an EC2 instance or as a periodic Lambda function. Both of these require and IAM role containing the correct permissions. Examples for each, as well as the IAM permissions, are provided below.
Scripted Cron Job (Ruby)
Lambda Function (Python)
You’ll notice that I’m not just collecting a single metric (the one I’m actually interested in), but am collecting supporting metrics as well. Anytime I publish customer metrics, I also publish a corresponding set of success/fail metrics, indicating whether or not the actual collection of the metric (in this case, the number of tunnels in UP status), was successful or not.
At 2 am, after my pager goes off and I’m looking for a root cause, these additional metrics can go a long way in determining where to look for the smoking gun. I always measure and report latency. If overall late cites in my system are increasing, I like the idea of being able to look at individual services and determine where the increased time is coming from.
IAM Role Permissions
These permissions are in addition to those required for standard Lambda function.
Lambda or script, which do you choose? I love the idea of a server-less solution, so Lambda is very appealing. However, Lambda functions can only be scheduled at five minute intervals or greater, no less.
For finer-grained metrics and monitoring, I used a self-healing t2.nano instance that checks and reports status every minute. Note that the EC2 instance running the script must have access to the internet, either directly or through a NAT instance, in order to make AWS API calls.
The second metric I want to collect and monitor is whether EC2 instances in both regions are able to communicate with each other. At setup, I used a simple ICMP ping to verify connectivity, performed manually by SSH-ing into one of the instances and pinging the other. I can do the same here, running as a cron job, on the same system running the VPN tunnel job above. Very simply, as a shell script using the AWS CLI:
I setup a cron job to run this script every minute. The IAM role create previously contains the necessary permissions for the script to run and make the put-metric-data CLI call.
Note: Watch for a future post on creating a simple, but effective and economical Lambda Canary function to accomplish the same. With the recently announced VPC support for Lambda, it’s surprisingly easy to implement and monitor.
Wrap up
That’s it. Now that both metrics are in place and being published, it’s a simple matter to configure Cloudwatch alarms for each metric and publish notification to an SNS topic when the metrics are in breach. For tunnel status, I created an alarm that triggers when the number of tunnels with UP status was less than one. For tunnel connectivity, I created an alarm that triggers when the tunnel connectivity test fails, and when tunnel latency exceeds 200 ms. Any time any of these alarms are triggers, an SMS message is sent to my phone and I can investigate and resolve any issues.
This is a good start towards high availability. At least now I know when something isn't working and can investigate. But there are better approaches. I consider this the first step on the path to high availability, determining when you’re not available and letting someone know. And for some, this may be sufficient. We all have our own tolerances and availability goals.
Next time, I’ll raise the bar and increase the availability of this system to keep my pager quiet. Stay tuned!
Visit http://www.rackspace.com/aws for more information about Fanatical Support for AWS and how it can help your business.