Support: 1-800-961-4454
Sales Chat
1-800-961-2888

Big Data On OpenStack

2

Big Data and OpenStack are probably the two most mentioned terms in the data center today. It’s become increasingly apparent that organizations have to deal with such massive amounts of data that traditional methods of scaling up with a bigger server or bigger memory or bigger disk are no longer feasible.

Nature of the problem

The Big Data problem is typically categorized by the “three Vs,” each of which is self evident and somewhat interrelated.

  • Volume
  • Velocity
  • Variety

A non-disruptive solution to this problem is to “scale up”.  However, the problem of “scaling up” is illustrated in the following graph. The time and cost to process the data grows exponentially.

Scalability

To be able to process the data within reasonable resources and time, a scaling out approach will offer superior results, as illustrated below. The goal is to flatten the cost and the application response time.

Hadoop takes this approach to Big Data by relying on commodity servers and providing an infrastructure that does data replication and has a degree of tolerance to faults, which is inherent in distributed systems.

Other Requirements and Ecosystem

To be able to process data in a timely fashion is a foremost requirement. Although batch processing of data will meet the needs in many organizations, there are other critical requirements as well. For instance:

  • Minimize data movement
  • Ability to encrypt data (for compliance reasons)
  • Ability to work with a variety of data loaders (ETL)
  • Data archival
  • Ability to do real-time queries
  • Ability for business (non-technical) folks to write queries
  • Ability to integrate with existing database/data warehouses

This has led to an explosion of ecosystem of vendors and products that embrace the horizontal scaling approach to try and meet the requirements outlined above. The downsides of these approaches are that they eschew time tested development techniques such as data normalization, joins and transactions (ACID properties).

The Taxonomies of these approaches include (but not limited to):

  • Hadoop: Based on a file system called Hadoop Distributed File System (HDFS) and related technologies such as Map/Reduce
  • NoSQL: MongoDB, Cassandra, CouchDB, Couchbase and so on
  • NewSQL:  InnoDB, Scalebase and newer technologies like NuoDB and so on

Hadoop is by far the most popular, but not the only elephant in the room. They all take the scale out approach and are governed by the CAP theorem (or Brewer’s conjecture). Unlike relational databases, which was a universal data platform until recently, none of these technologies or products meet all of the requirements of a business as outlined above.

Irrespective of the data platform, there is a need to be able to stand these up in a cloud, which is where OpenStack and the private clouds derived from OpenStack (such as the Rackspace Private Cloud) can help with standardizing operations in the data center and speed up development. For example, data security and privacy might be a critical requirement that can be met only by a private cloud.

There are prevailing questions about virtualization and keeping the data localized to the compute, but most approaches involve multiple technologies installed on an elastic infrastructure running on OpenStack or a hybrid cloud.

Example Architecture(s)

An example of multiple products and nodes running on a Rackspace Hybrid Cloud is illustrated below and in this talk on this topic presented at the OpenStack Design Summit.

There are multiple products and technologies running in a hybrid cloud that are able to meet the requirements of mining information off the Rackspace public cloud and being able to use that intelligence to monitor and tweak the public cloud.

Summary

The Big Data problem is a multi-faceted one. The traditional relational database approach has given way to a scale out approach. To be able to manage this elastic infrastructure to meet the varying needs of data, a multi-technology and a hybrid approach based on OpenStack might be best suited with an eye to the future.

About the Author

This is a post written and contributed by Raghavan "Rags" Srinivas.

Raghavan "Rags" Srinivas works as a solutions architect at Rackspace where he finds himself constantly challenged from low level networking to high level application issues. His general focus area is in distributed systems, with a specialization in Cloud Computing and Big Data. He worked on Hadoop, HBase and NoSQL during its early stages. He has spoken on a variety of technical topics at conferences around the world, conducted and organized Hands-on Labs and taught graduate classes in the evening.

Rags brings with him over 20 years of hands-on software development and over 10 years of architecture and technology evangelism experience. He has evangelized and influenced the architecture of a number of technology areas. He is also a repeat JavaOne rock star speaker award winner.

Rags holds a Masters degree in Computer Science from the Center of Advanced Computer Studies at the University of Louisiana at Lafayette. He likes to hike, run and generally be outdoors, but most of all he loves to eat.


More
2 Comments

Hi Rags,

True, when both technologies: Big Data & Cloud merges then there is huge potential.
I am experimenting with Low latency stream data computation like Storm over OpenStack. Is there any research happening at Rackspace and architecture available ?

Thanks,
Shankar.

avatar Shankar Ganesh P J on August 30, 2013 | Reply

It would be great if you would collaborate with us on introducing NuoDB to your readership! I am confident that Rackspace users would be very interested in learning more about how NuoDB fits into the Rackspace ecosystem.

avatar Michael Waclawiczek on September 16, 2013 | Reply

Leave a New Comment

(Required)


Racker Powered
©2014 Rackspace, US Inc.