We're incredibly excited now that Cloud Servers has launched. It represents a major piece of our cloud suite and lays a solid foundation for lots of goodies and powerful compute cloud capabilities we have in the works. Now that Cloud Servers is here, I'd like to examine a fundamental design difference between Cloud Servers and EC2 that has dramatic downstream ramifications: persistence.
A persistent system is one that does not go away when failures occur. That is, system information is generally recoverable once the system is restored. Cloud servers are persistent. Contrast that with an ephemeral or transient system which lives only as long as the system is running. When failures occur in an ephemeral system, system information is lost. EC2 instances are ephemeral. You can think of this like non-volatile and volatile memory. Non-volatile memory is persistent and doesn’t lose data if powered off. Volatile memory is ephemeral and can only retain information when the power is on. Cloud Servers is like the non-volatile SD memory in your digital camera while EC2 is like the volatile RAM in your laptop.
That's a significant functional difference that has huge implications on how the system is used, how it's supported, and how it's designed. Let's take a look at each.
When you launch an EC2 instance, the virtual machine (VM) and included storage are ephemeral. That means if your instance fails for any reason, you’ve lost your entire VM as well as all data in that VM. If you use Elastic Block Store (EBS), that helps as it provides persistent storage, but the instance itself is still ephemeral. That means although EBS data can survive an instance failure, the server, its configuration, log data for troubleshooting, completed work that has not been offloaded, etc. is all lost. For transitory/batch workloads (e.g. video transcoding), ephemeral system failures generally aren’t too significant because even though data may have been lost, those workloads can just be restarted. But, for persistent workloads (e.g. a web site) that you want to be available, you need to design and build your application on top of EC2 to expect and respond to underlying system failures. You need to do things like monitor for instance failure, dynamically rebuild and reconfigure new instances on the fly, ensure data you don’t want to lose is always replicated or on EBS, be able to rollback and recover from lost work when an instance fails, etc. While these are considered internet-scale design principle (PDF) best practices, they are more complex than traditional design principles and are not how the majority of web applications are built today.
By contrast, when you launch a cloud server, the virtual machine and storage are persistent. That means if a cloud sever fails, the problem is fixed and your cloud server is brought back online. Your cloud server and associated data don't go away. As such, you don't have to think about your cloud server any differently than you do a traditional server in how you use it and build apps on top. While Cloud Servers can support internet-scale applications, we don't impose those design principles and associated complexity by making virtual machines ephemeral. Of course, you are free to employ those principles if you'd like, but you don't have to. That means you have flexibility with Cloud Servers. You can start using the standard design constructs you're familiar with, and only after you become the next Twitter or Facebook (congrats!) do you need to embrace more complex principles.
Ephemeral EC2 instances are easier to support because essentially, there is nothing to support. When ephemeral instances fail, you are out of luck. You, as a customer, should expect that kind of failure and the onus for dealing with and responding to failures is on you. If you had important data on your instance that wasn't offloaded somewhere, it's gone. Amazon has no obligation to get your instance back up or recover your data.
Persistent cloud servers are much more difficult to support. That's because when a host fails, the onus is on Rackspace to get your cloud server back up for you. That's more work for us, but we'd rather assume the burden so you can sleep better at night. That's what Fanatical Support is all about and is but one example of how it's being translated to the cloud.
There are also significant ramifications on how the persistent vs. ephemeral natures of Cloud Servers and EC2 affect SLAs, but I’ll save that for a future post where we can focus on that specifically.
Amazon purportedly built EC2 for internal use and later opted to expose it as an external service. It is therefore designed to support systems on the scale of Amazon, eBay, Hotmail, etc., embracing large scale system design best practices. One such tenet prescribes a commodity hardware approach where applications deal with underlying infrastructure failures rather than trying to make the infrastructure itself more resilient (I call this application-centric availability). The ephemeral nature of EC2 is likely driven by this commodity approach as a commodity platform itself can make little guarantee of availability. That is, although overall system availability can be guaranteed at a macro level, individual instances at a micro level cannot. Practically, that means, for example, that hosts in EC2 do not have underlying RAID protection on local disks, and as a result, a host hard drive failure could take out one or more instances.
We designed Cloud Servers not just to support large scale apps, but also to support small scale ones as well. Rackspace agrees with and has embraced large scale design principles for systems where the infrastructure is abstracted away and we provide a higher level service. But, with Cloud Servers, the infrastructure IS the service and we didn't feel like we could take a full commodity approach. Instead, we followed a "mostly commodity" approach, being judicious about cost so we could provide a well-valued solution, but investing in infrastructure availability where it made sense. To continue with the example above, RAID is one such area where we made that investment. Hard drives are THE most likely component to fail and offering a quality, persistent compute service meant designing in hard drive redundancy (RAID10 specifically) to protect against that type of failure.
We believe persistence is the right approach (although note that ephemeral VMs could fairly easily be offered on a persistent system while the converse is not true - implementing persistence as an option in a system designed to be ephemeral is non-trivial). It's the way most people in the world think about servers and we don't think deriving the benefits of the cloud should mean having to embrace more complex design principles that may not apply. Persistence is harder to achieve but well worth it for the customer. It offers simplicity and peace of mind for some but can also be a viable platform for those wanting to build more complex, dynamic applications. It's the best of both worlds.
Chief Architect, Rackspace Cloud