For over thirty years, relational database technology has been the gold standard. Modern workloads and unprecedented data volumes, however, are driving businesses to look at alternatives to the traditional relational database. This “NoSQL movement” has given rise to a host of non-relational-database technologies, designed for large-capacity storage and scalability. Some businesses may find that the best solution is a combination of both relational and non-relational databases—whichever tool is best for the job. In this regard, “NoSQL” is probably better referred to as, “Not Only SQL,” rather than “No SQL at all.”
NoSQL technologies vary widely, but they can be evaluated based on three key features: scalability, data and query model, and persistence design. Below we will investigate ten popular NoSQL database options.
In this context, “scalability” refers to scaling writes by automatically partitioning data across multiple machines. Systems that do this are called “distributed databases,” and they include Cassandra, HBase, Riak, Scalaris, Voldemort, and others. If your write volume or data size is more than one machine can handle, then these are your only options if you don’t want to manage partitioning manually. (You don’t.)
When choosing a distributed database, look for:
1) support for multiple datacenters and
2) the ability to add new machines to a live cluster transparently to your applications.
Non-distributed NoSQL databases include CouchDB, MongoDB, Neo4j, Redis, and Tokyo Cabinet, and they can serve as persistence layers for distributed systems. MongoDB provides limited support for sharding, as does a separate Lounge project for CouchDB, and Tokyo Cabinet can be used as a Voldemort storage engine.
Data and Query Model
NoSQL databases vary widely in regard to data models and query APIs: (Respective Links: Thrift, map/reduce views, Thrift, Cursor, Graph, Collection, Nested hashes, get/put, get/put, get/put)
Here are some highlights:
- With the columnfamily model, you have rows and columns like you would expect, but the rows are sparse, meaning each row can have as many or as few columns as desired, and columns do not need to be defined ahead of time.
- The Key/value model is simple and easy to implement, but it is inefficient when you are only interested in querying or updating part of a value. It’s also difficult to implement more-sophisticated structures on top of distributed key/value.
- With Document databases, you get essentially the next level of Key/value, allowing nested values associated with each key. Document databases support querying those more efficiently than simply returning the entire blob each time.
- Neo4J uses the unique Graph data model, storing objects and relationships as nodes and edges in a graph. For queries that fit this model (e.g., hierarchical data), they can be thousands of times faster than alternatives.
- Unlike the others, Scalaris offers distributed transactions across multiple keys. There are, however, trade-offs between consistency and availability that you should keep in mind.
Persistence design, in this instance, refers to how the data is stored internally: By evaluating the persistence model, you can determine the best fit for your work load.
- In-memory databases are extremely fast (Redis achieves over 100,000 operations per second on a single machine), but they cannot work with data sets that exceed available RAM. Durability (i.e., retaining data even if a server crashes or loses power) can also be a problem; the amount of data you can expect to lose between flushes (copying the data to disk) is potentially large. Scalaris, however, tackles the durability problem with replication; but since it does not support multiple data centers, your data will be still be vulnerable to server crashes and power failures.
- Memtables and SSTables buffer writes in memory (a “memtable”) after writing to an append-only commit log for durability. When enough writes have been accepted, the memtable is sorted and written to disk all at once as an “sstable”—providing close to in-memory performance since no seeks are involved, while avoiding the durability problems of purely in-memory approaches. (To learn more, check out sections 5.3 and 5.4 of Google’s Bigtable paper, as well as The log-structured merge-tree.)
- B-Trees have been used in databases for decades because of their robust indexing support. Their performance is poor, though, on rotational disks (which are still by far the most cost-effective) because of the multiple seeks involved in reading or writing anything.
CouchDB’s append-only B-Trees provide an interesting variant by avoiding the overhead of seeks, unfortunately at the cost of limiting CouchDB to one write at a time.
In 2009 the NoSQL movement hit the ground running, and has continued to grow and more and more businesses wrestle with ways to process large volumes of data. Conferences and announcements, as well as discussion about the NoSQL movement are located on the Google discussion group.