Unprecedented data volumes are driving businesses to look at alternatives to the traditional relational database technology that has served us well for over thirty years. Collectively, these alternatives have become known as “NoSQL databases.”
The fundamental problem is that relational databases cannot handle many modern workloads. There are three specific problem areas: scaling out to data sets like Digg’s (3 TB for green badges) or Facebook’s (50 TB for inbox search) or eBay’s (2 PB overall), per-server performance, and rigid schema design.
Businesses, including The Rackspace Cloud, need to find new ways to store and scale large amounts of data. I recently wrote a post on Cassandra, a non-relational database we have committed resources to. There are other non-relational databases being worked on and collectively, we call this the “NoSQL movement.”
The “NoSQL” term was actually coined by a fellow Racker, Eric Evans when Johan Oskarsson of Last.fm wanted to organize an event to discuss open source distributed databases. The name and concept both caught on.
Some people object to the NoSQL term because it sounds like we’re defining ourselves based on what we aren’t doing rather than what we are. That’s true, to a degree, but the term is still valuable because when a relational database is the only tool you know, every problem looks like a thumb. NoSQL is making people aware that there are other options out there. But we’re not anti-relational-database for when that really is the best tool for the job; it’s “Not Only SQL,” rather than “No SQL at all.”
One real concern with the NoSQL name is that it’s such a big tent that there is room for very different designs. If this is not made clear when discussing the various products, it results in confusion. So I’d like to suggest three axes along which to think about the many database options: scalability, data and query model, and persistence design.
I have chosen 10 NoSQL databases as examples. This is not an exhaustive list, but the concepts discussed are crucial for evaluating others as well.
Scaling reads is easy with replication, so when we’re talking about scaling in this context, we mean scaling writes by automatically partitioning data across multiple machines. We call systems that do this “distributed databases.” These include Cassandra, HBase, Riak, Scalaris, Voldemort, and more. If your write volume or data size is more than one machine can handle then these are your only options if you don’t want to manage partitioning manually. (You don’t.)
There are two things to look for in a distributed database: 1) support for multiple datacenters and 2) the ability to add new machines to a live cluster transparently to your applications.
Non-distributed NoSQL databases include CouchDB, MongoDB, Neo4j, Redis, and Tokyo Cabinet. These can serve as persistence layers for distributed systems; MongoDB provides limited support for sharding, as does a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a Voldemort storage engine.
Data and Query Model
There is a lot of variety in the data models and query APIs in NoSQL databases.
The columnfamily model shared by Cassandra and HBase is inspired by the one described by Google’s Bigtable paper, section 2. (Cassandra drops historical versions, and adds supercolumns.) In both systems, you have rows and columns like you are used to seeing, but the rows are sparse: each row can have as many or as few columns as desired, and columns do not need to be defined ahead of time.
The Key/value model is the simplest and easiest to implement but inefficient when you are only interested in querying or updating part of a value. It’s also difficult to implement more sophisticated structures on top of distributed key/value.
Document databases are essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying those more efficiently than simply returning the entire blob each time.
Neo4J has a really unique data model, storing objects and relationships as nodes and edges in a graph. For queries that fit this model (e.g., hierarchical data) they can be 1000s of times faster than alternatives.
Scalaris is unique in offering distributed transactions across multiple keys. (Discussing the trade-offs between consistency and availability is beyond the scope of this post, but that is another aspect to keep in mind when evaluating distributed systems.)
By persistence design I mean, “how is data stored internally?”
The persistence model tells us a lot about what kind of workloads these databases will be good at.
In-memory databases are very, very fast (Redis achieves over 100,000 operations per second on a single machine), but cannot work with data sets that exceed available RAM. Durability (retaining data even if a server crashes or loses power) can also be a problem; the amount of data you can expect to lose between flushes (copying the data to disk) is potentially large. Scalaris, the other in-memory database on our list, tackles the durability problem with replication, but since it does not support multiple data centers your data will be still be vulnerable to things like power failures.
Memtables and SSTables buffer writes in memory (a “memtable”) after writing to an append-only commit log for durability. When enough writes have been accepted, the memtable is sorted and written to disk all at once as a “sstable.” This provides close to in-memory performance since no seeks are involved, while avoiding the durability problems of purely in-memory approaches. (This is described in more detail in sections 5.3 and 5.4 of the previously-referenced Bigtable paper, as well as in The log-structured merge-tree.)
B-Trees have been used in databases since practically the beginning of time. They provide robust indexing support, but performance is poor on rotational disks (which are still by far the most cost-effective) because of the multiple seeks involved in reading or writing anything.
An interesting variant is CouchDB’s append only B-Trees, which avoids the overhead of seeks at the cost of limiting CouchDB to one write at a time.
The NoSQL movement has exploded in 2009 as an increasing number of businesses wrestle with large data volumes. The Rackspace Cloud is pleased to have played an early role in the NoSQL movement, and continues to commit resources to Cassandra and support events like NoSQL East.
NoSQL conference announcements and related discussion can be found on the Google discussion group.