Cassandra 1.0, the cloud, and the future of big data

Filed in Cloud Industry Insights by Jonathan Ellis | September 21, 2011 9:10 am

Jonathan Ellis is the CTO and co-founder of DataStax[1]. DataStax is the developer of DataStax Enterprise, a distributed, scalable, and highly available database platform.

[2]

Apache Cassandra[3] has come a long way since I first started writing about it at Rackspace[4]. Since then, I started DataStax[5] to commercialize Cassandra, we’ve had six major releases[6], run not one[7] but two[8] summits with hundreds of attendees, and now we’re about to release Cassandra 1.0 on October 8.

What’s new in Cassandra 1.0?

Cassandra was born as a hybrid of the best of Amazon’s Dynamo[9] and Google’s Bigtable, but has now moved beyond its parents in several ways:

The cloud and Cassandra

The cloud is about providing infrastructure as a commodity[19]: scaling up and down at will, paying for what you actually need instead of having to build out capacity for your largest spikes, and offloading datacenter operations to specialists.

However, the cloud has had trouble supporting a full traditional application stack: it’s easy to spin up a thousand web servers, for instance, since each can work independently. But most applications require maintaining some kind of durable state, and the relational databases that (until recently) have been our go-to choice for that don’t work that way[20].

One solution is to use a hybrid cloud: companies like Rackspace that offer both cloud and traditional hosting give you the flexibility of cloud for stateless computation, and specialized, more powerful hardware for database servers and similar core tasks. This is the approach github took with their move to Rackspace[21].

The other solution is to start using a database that scales across the kind of commodity machines that you find in the cloud. This is the route Netflix took when they moved off of their own datacenters and Oracle to EC2[22] and Cassandra[23].

As a side benefit, when you take this approach you can leverage cloud APIs to reduce your ops complexity even further. For example, Cassandra provides pluggable seed provider[24] and snitch[25] APIs that can be pointed at Cloud services to tell the Cassandra cluster “who are my peers in the cluster” and “where are they located,” respectively, rather than configuring these manually via configuration files.

The future of Big Data

In the early days of relational databases, query volumes and data sets were small enough that you could handle your realtime application needs and your analytics with the same database. But these two workloads are different enough that optimizing a single system for both is impossible, so separate systems evolved: OLTP[26] for the former, and OLAP[27] for the latter, although terminology around analytics is less well-defined[28], including data warehousing, business intelligence, data mining, and others. From this we got systems like MySQL that focused on realtime workloads and others like Teradata focusing on data warehousing, and complex ETL[29] processes to move data between the two.

Today, you still see this split with scalable, NoSQL[30] databases like Cassandra for realtime workloads, Hadoop for big data analytics, and ETL between the two. Maintaining and integrating two different systems causes a lot of operational complexity.

To address this complexity, DataStax is launching DataStax Enterprise[31], marrying the scalability and reliability of Cassandra with analytics with no ETL, by using Cassandra’s advanced replication to keep the two workloads separate, but seamlessly connected. With this approach, analytical work doesn’t slow realtime processing, but both the analytical side and realtime side can see changes made by the other as they happen.

This is a guest post, the opinions of the author may not reflect those of Rackspace.

 





Endnotes:
  1. DataStax: http://www.datastax.com/
  2. [Image]: http://cassandra.apache.org/
  3. Apache Cassandra: http://cassandra.apache.org/
  4. writing about it at Rackspace: http://www.rackspace.com/cloud/blog/2009/09/23/the-cassandra-project/
  5. DataStax: http://datastax.com/
  6. six major releases: https://svn.apache.org/repos/asf/cassandra/branches/cassandra-1.0/NEWS.txt
  7. one: http://www.datastax.com/dev/blog/slides-and-videos-cassandra-summit-2010
  8. two: http://www.datastax.com/events/cassandrasf2011/presentations
  9. Dynamo: http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
  10. creating and querying indexes on columns: http://www.datastax.com/dev/blog/whats-new-cassandra-07-secondary-indexes
  11. Cassandra’s wide row support:
  12. Cassandra Query Language: http://www.datastax.com/dev/blog/what%E2%80%99s-new-in-cassandra-0-8-part-1-cql-the-cassandra-query-language
  13. NoSQL systems: http://www.rackspace.com/cloud/blog/2009/11/09/nosql-ecosystem/
  14. domain-specific language: http://en.wikipedia.org/wiki/Domain-specific_language
  15. consistently: http://www.edwardcapriolo.com/roller/edwardcapriolo/entry/ycsb_cassandra_0_7_6
  16. show: http://blog.cubrid.org/dev-platform/nosql-benchmarking
  17. DataStax developer blog soon: http://www.datastax.com/dev/blog
  18. compression: http://www.datastax.com/dev/blog/whats-new-in-cassandra-1-0-compression
  19. providing infrastructure as a commodity: http://www.rackspace.com/knowledge_center/whitepaper/moving-your-infrastructure-to-the-cloud-how-to-maximize-benefits-and-avoid-pitfalls
  20. don’t work that way: http://www.slideshare.net/jbellis/what-every-developer-should-know-about-database-scalability
  21. move to Rackspace: https://github.com/blog/493-github-is-moving-to-rackspace
  22. EC2: http://www.slideshare.net/adrianco/netflix-on-cloud-combined-slides-for-dev-and-ops
  23. Cassandra: http://www.slideshare.net/adrianco/migrating-netflix-from-oracle-to-global-cassandra
  24. seed provider: http://www.datastax.com/docs/0.8/configuration/node_configuration#seed-provider
  25. snitch: http://www.datastax.com/docs/0.8/configuration/node_configuration#endpoint-snitch
  26. OLTP: http://en.wikipedia.org/wiki/Online_transaction_processing
  27. OLAP: http://en.wikipedia.org/wiki/Online_analytical_processing
  28. less well-defined: http://www.dbms2.com/2011/01/03/the-six-useful-things-you-can-do-with-analytic-technology/
  29. ETL: http://en.wikipedia.org/wiki/Extract,_transform,_load
  30. NoSQL: http://www.rackspace.com/cloud/blog/2009/11/09/nosql-ecosystem/
  31. DataStax Enterprise: http://www.datastax.com//2011/09/datastax-launches-datastax-enterprise

Source URL: http://www.rackspace.com/blog/cassandra-1-0-the-cloud-and-the-future-of-big-data/