You may have heard about the Cassandra distributed database in recent articles or conferences. I’d like to explain what advantages Cassandra offers over traditional relational databases like MySQL or Oracle and why Rackspace has committed resources to the Cassandra project.
The Cassandra project was started by Facebook in 2007 to scale their internal applications, particularly Inbox Search. Earlier this year, they released it to the Apache incubator where other people from the community could become involved and start contributing. This allowed the project to move forward in a direction that is more general to the public than just to Facebook’s needs.
In March, I became the first outside committer to this Apache Incubator project. Eric Evans from Rackspace and Jun Rao from IBM Research soon followed, and we recently added Chris Goffinet from Digg. The community has grown from 5 people in the IRC channel in December to over 60.
Distributed vs. Relational Databases
Traditional relational databases are 30 years old, are well understood and have a huge ecosystem of tools around them. For that reason, it’s a compelling option when building your application. Postgres, MySQL, and Oracle are all relational databases modeling a schema on entities and relations between those entities. That’s a good, powerful programming model with interesting theoretical properties. But companies with large amounts of data have already gone past what you can reasonably fit on a single machine, even on high-end hardware, and it’s provably impossible to keep the traditional relational model, in particular the ACID properties, while scaling across multiple machines. Even if you’re willing to give up availability, scaling reads (via caching and replication) is difficult with relational databases, and scaling writes by partitioning is either very expensive, very painful from an application programming and operations standpoint, or both.
Cassandra is taking the approach that, given that you're going to have to give up some parts of the relational model to scale, let's start over and rethink things. Let's add things like transparent replication and failover, built-in partitioning and load balancing, multiple data center support, and the ability to add capacity without ever disturbing applications running against the database.
The original Facebook team has been busy elsewhere, so the community has had to step up and take the initiative in moving Cassandra forward. Cassandra is open source and I don't want to downplay others' contributions, including those from IBM Research, Digg, and Twitter as well as other companies and individuals, but I'm proud that Rackspace's support has been instrumental in adding many important new features, fixing bugs, and getting out new releases.
Here are 3 reasons why Rackspace has committed resources:
1- As stated in previous posts by Erik Carlin, we are committed to an Open Cloud. With Amazon’s Simple DB or Google App Engine’s datastore, you’re locked in. Cassandra presents an open alternative: you can write against Cassandra and deploy anywhere. That’s important.
2- We have a suite of Cloud products that are productized beyond just the raw Cloud Servers. Cassandra is interesting to us because we can use it under the hood to improve Cloud Sites and Cloud Files. And people are already starting to ask, "When can I just go to Rackspace and deploy a preconfigured Cassandra cluster?" It's still early, but that's definitely something we're looking at.
3- Rackspace itself has a ton of data that we generate from our switches and routers and the rest of our infrastructure. Right now we are getting by with traditional monitoring and logging technologies, searching those logs and so forth. Cassandra will help us a lot with that as our volumes continue to increase. Our Mail & Apps products are also very interested in using Cassandra to store mail messages and other data.
Finally, I want to emphasize Cassandra is not a magic bullet. You can't just take your SQL app and put it on Cassandra and expect it to work. It's a different programming model and instead of modeling as entities and relationships and just adding indexes to get performance, you need to think at a more basic level: "What information do I need to retrieve from each query?" and model your Cassandra schema accordingly. It's a different way of thinking and does require new code to be written. It's very much for people that have a lot data that doesn't fit on a single machine and are feeling the pain from traditional approaches to scaling that.
We plan to write some other posts in the future detailing what a switch might look like for some sample applications.