Filed in Cloud Industry Insights by Jonathan Ellis | September 21, 2011 9:10 am
Jonathan Ellis is the CTO and co-founder of DataStax[1]. DataStax is the developer of DataStax Enterprise, a distributed, scalable, and highly available database platform.
[2]
Apache Cassandra[3] has come a long way since I first started writing about it at Rackspace[4]. Since then, I started DataStax[5] to commercialize Cassandra, we’ve had six major releases[6], run not one[7] but two[8] summits with hundreds of attendees, and now we’re about to release Cassandra 1.0 on October 8.
Cassandra was born as a hybrid of the best of Amazon’s Dynamo[9] and Google’s Bigtable, but has now moved beyond its parents in several ways:
The cloud is about providing infrastructure as a commodity[19]: scaling up and down at will, paying for what you actually need instead of having to build out capacity for your largest spikes, and offloading datacenter operations to specialists.
However, the cloud has had trouble supporting a full traditional application stack: it’s easy to spin up a thousand web servers, for instance, since each can work independently. But most applications require maintaining some kind of durable state, and the relational databases that (until recently) have been our go-to choice for that don’t work that way[20].
One solution is to use a hybrid cloud: companies like Rackspace that offer both cloud and traditional hosting give you the flexibility of cloud for stateless computation, and specialized, more powerful hardware for database servers and similar core tasks. This is the approach github took with their move to Rackspace[21].
The other solution is to start using a database that scales across the kind of commodity machines that you find in the cloud. This is the route Netflix took when they moved off of their own datacenters and Oracle to EC2[22] and Cassandra[23].
As a side benefit, when you take this approach you can leverage cloud APIs to reduce your ops complexity even further. For example, Cassandra provides pluggable seed provider[24] and snitch[25] APIs that can be pointed at Cloud services to tell the Cassandra cluster “who are my peers in the cluster” and “where are they located,” respectively, rather than configuring these manually via configuration files.
In the early days of relational databases, query volumes and data sets were small enough that you could handle your realtime application needs and your analytics with the same database. But these two workloads are different enough that optimizing a single system for both is impossible, so separate systems evolved: OLTP[26] for the former, and OLAP[27] for the latter, although terminology around analytics is less well-defined[28], including data warehousing, business intelligence, data mining, and others. From this we got systems like MySQL that focused on realtime workloads and others like Teradata focusing on data warehousing, and complex ETL[29] processes to move data between the two.
Today, you still see this split with scalable, NoSQL[30] databases like Cassandra for realtime workloads, Hadoop for big data analytics, and ETL between the two. Maintaining and integrating two different systems causes a lot of operational complexity.
To address this complexity, DataStax is launching DataStax Enterprise[31], marrying the scalability and reliability of Cassandra with analytics with no ETL, by using Cassandra’s advanced replication to keep the two workloads separate, but seamlessly connected. With this approach, analytical work doesn’t slow realtime processing, but both the analytical side and realtime side can see changes made by the other as they happen.
This is a guest post, the opinions of the author may not reflect those of Rackspace.
Source URL: http://www.rackspace.com/blog/cassandra-1-0-the-cloud-and-the-future-of-big-data/
Copyright ©2013 The Official Rackspace Blog unless otherwise noted.