Big Data On OpenStack

Filed in Cloud Industry Insights by Raghavan "Rags" Srinivas | August 29, 2013 1:00 pm

Big Data and OpenStack are probably the two most mentioned terms in the data center today. It’s become increasingly apparent that organizations have to deal with such massive amounts of data that traditional methods of scaling up with a bigger server or bigger memory or bigger disk are no longer feasible.

Nature of the problem

The Big Data problem is typically categorized by the “three Vs,” each of which is self evident and somewhat interrelated.

A non-disruptive solution to this problem is to “scale up”.  However, the problem of “scaling up” is illustrated in the following graph. The time and cost to process the data grows exponentially.


To be able to process the data within reasonable resources and time, a scaling out approach will offer superior results, as illustrated below. The goal is to flatten the cost and the application response time.

Hadoop takes this approach to Big Data by relying on commodity servers and providing an infrastructure that does data replication and has a degree of tolerance to faults, which is inherent in distributed systems.

Other Requirements and Ecosystem

To be able to process data in a timely fashion is a foremost requirement. Although batch processing of data will meet the needs in many organizations, there are other critical requirements as well. For instance:

This has led to an explosion of ecosystem of vendors and products that embrace the horizontal scaling approach to try and meet the requirements outlined above. The downsides of these approaches are that they eschew time tested development techniques such as data normalization, joins and transactions (ACID properties).

The Taxonomies of these approaches include (but not limited to):

Hadoop is by far the most popular, but not the only elephant in the room. They all take the scale out approach and are governed by the CAP theorem (or Brewer’s conjecture)[1]. Unlike relational databases, which was a universal data platform until recently, none of these technologies or products meet all of the requirements of a business as outlined above.

Irrespective of the data platform, there is a need to be able to stand these up in a cloud, which is where OpenStack and the private clouds derived from OpenStack (such as the Rackspace Private Cloud[2]) can help with standardizing operations in the data center and speed up development. For example, data security and privacy might be a critical requirement that can be met only by a private cloud.

There are prevailing questions about virtualization and keeping the data localized to the compute, but most approaches involve multiple technologies installed on an elastic infrastructure running on OpenStack or a hybrid cloud[3].

Example Architecture(s)

An example of multiple products and nodes running on a Rackspace Hybrid Cloud is illustrated below and in this talk on this topic presented at the OpenStack Design Summit[4].

There are multiple products and technologies running in a hybrid cloud that are able to meet the requirements of mining information off the Rackspace public cloud[5] and being able to use that intelligence to monitor and tweak the public cloud.


The Big Data problem is a multi-faceted one. The traditional relational database approach has given way to a scale out approach. To be able to manage this elastic infrastructure to meet the varying needs of data, a multi-technology and a hybrid approach based on OpenStack might be best suited with an eye to the future.

  1. CAP theorem (or Brewer’s conjecture):
  2. Rackspace Private Cloud:
  3. hybrid cloud:
  4. OpenStack Design Summit:
  5. Rackspace public cloud:

Source URL: