Support: 1-800-961-4454
1-800-961-2888

Know MapReduce? No Problem!

In an ever-evolving competitive landscape it’s often hard to keep up the level of competencies inside your organization to meet the demands of rapidly changing technology.

You have taken care and consideration in choosing the right people for your company. These are people that you trust to grow your business and keep your customers happy. Too often, disruptive technology emerges that causes you to seek out professionals with additional skills. If you are on the hirable side of this equation, and have the time and flexibility to learn the most in-demand coding language or cutting edge technology, you will most likely reap the financial benefits of being constantly “in-demand.” If you are on the other side of the equation, you might find yourself struggling to affirm your value to the company and showcase your talents in a way that is visible to your peers. The refreshing news is that good companies understand the value of committed and passionate professionals, and tools are being created that allow IT professionals to work within their comfort zones and still leverage new technology.

In the Big Data space, a traditional database administrator is now tasked with understanding a new programing language and technology called MapReduce. In a recent search on glassdoor.com there were more than 382 open MapReduce positions. MapReduce is a framework for processing queries over large datasets. It involves taking a large amount of seemingly varied data and dividing it into tasks performed locally on independent nodes. The entire process consists of two steps.

  • Map – The process where a master node orchestrates a task you are trying to perform and divides it into multiple jobs that are then executed locally on several worker nodes.
  • Reduce – Is a process where a master node assembles the “answers” and combines them to produce a reduced output.

This is the voodoo of how applications like HadoopTM can use many servers to distribute large workloads to more manageable local tasks that scale horizontally with built-in redundancy. This functionality on commodity hardware allows you to analyze and digest large amounts of data at a fairly low infrastructure cost, which makes it extremely compelling to explore. While the servers and software are available at a low cost or free; the hidden costs of exploring MapReduce technology can be great and can be difficult for organizations to adopt without hindering other tracks of business or consuming valuable internal resources.

This introduces the often-cumbersome challenge of transitioning people to new technologies. MapReduce technology has only been in existence for a handful of years. Some would argue that it become a legitimate need in most organizations only in the last three years. When a technology is new it is often hard to find people with the right training and understanding to tackle the new challenge. It is even harder to find anyone with pronounced experience in doing it. If the job postings are any indication, there is a healthy competitive market for these professionals. However, what about the people inside your organization who you trust with your entire data strategy? These people are tasked with keeping your entire system of record and every dollar and piece of inventory in check. Why would you replace these valuable people just because they don’t speak the language of your new technology? Luckily, with open community-built tools you don’t have to!

The Apache user community has developed some great tools to help you harness the power of Big Data platforms like Apache Hadoop without any knowledge of writing MapReduce. These programing tools allow you to use familiar coding language to instruct the dataset and operation, which removes the learning curve:

  • Apache Pig – A tool developed by Yahoo and submitted back into the Apache Hadoop community, Apache Pig allows you to write code in a very SQL friendly manner that is then translated into a MapReduce process. The language they use for this SQL-based code is humorously called “Pig Latin.” This allows you to apply constructs from relational databases into environments with heavily un-relational data.
  • Apache Hive – A slightly different animal (excuse my bad zoo puns), Apache Hive is a data warehouse system that runs on top of Hadoop to provide summarization of data. Hive can take information in relational databases or Hbase and translate it into a language called HiveSQL. It also adds some very handy capabilities to index and query datasets faster.

It is important to note that while both offerings provide a similar and familiar framework to operate in, they are not identical to SQL and will require some learning. However, they both vastly reduce the amount of time needed to ramp into the Hadoop solution without any previous context.

At Rackspace, we aim to empower you with the tools and assistance to explore these technologies. We have partnered with Hortonworks to provide an on-demand; fully configured Hadoop offering with no capital commitment (coming soon). In addition to the infrastructure, we are also allowing you to interface with the platform using Pig and Hive. We hope this will allow more companies to leverage these types of technologies with little risk to the business and allow you to utilize the powerful people capital that has made your business a success.

About the Author

This is a post written and contributed by Sean Anderson.

Sean is a tenured infrastructure scaling and cloud strategy consultant with a strong focus on strategic partnerships and innovative hybrid technology. A five-year Racker; Sean is focused on developing deep relationships across the technology landscape. Previously, Sean has held several sales director positions at startups and also headed up a team of products specialists for a global retailer. He began his Rackspace journey managing key applications and growth for enterprise customers in the Big Data and Business Analytics sectors. Sean’s focus quickly turned to Big Data Solutions and new platforms like Hadoop, MongoDB and Cassandra. His ever evolving interest in tackling the data problems of tomorrow has made him a key resource to companies looking to harness the power of innovative data and analytics platforms. Sean is currently the product marketing manager for cloud big data solutions at Rackspace.


More
Racker Powered
©2014 Rackspace, US Inc.