This summer, we brought several interns aboard at our San Francisco Office (SFO). In this blog series, these interns share tales of their times as Rackspace summer interns.
Intro to Blueflood
One of the most unique products in the Cloud Monitoring portfolio is Blueflood, the database backend for the monitoring system at Rackspace. It is designed and built to withstand bombardments of data from many of our Scribe nodes. Blueflood is generally used in a demanding production environment. It scales easily to keep up with ever-increasing workloads, and it is expected to operate smoothly when facing unexpected data center failures.
Although the team that oversees Blueflood is tasked with tons of responsibilities, we are the smallest group within the Cloud Monitoring product team at Rackspace. We recently open sourced Blueflood. Here’s an excellent blog post about it.
The Need for Search
The Blueflood team and I are working on implementing Blueflood’s Metrics Discovery feature. Our current capability is very Cloud Monitoring specific. The existing API allows users to ask for all metrics for a given entityId and checkId. Blueflood’s internal use of Cassandra as its database backend does not lend itself well to search. Our implementation is interesting in that Blueflood stores clients’ locators (strings in the form of accountId.entityId.checkId) and metrics as rows and columns in Cassandra.
We would also like to support metrics discovery via glob matching. So for example, a query of the form acctId.*.chkId.* would return all metrics with the specified accountId and checkId regardless of their entities. This feature is useful in supporting a Graphite-like querying, where we could ask for CPU utilization rates across all data centers and nodes with a simple one-line query, as opposed to asking for that metric for each individual node.
What and Why?
It’s pretty easy to think of ElasticSearch as a group of Apache Lucene workers. Lucene is an open source full text search library. Full text search allows you to do things like search for the number of occurrences of a certain word in an essay. ElasticSearch implements other useful functionalities on top of Lucene, such as a HTTP interface, nodes discovery or workload management among Lucene workers. These extra functionalities allow ElasticSearch to be used as a distributed search server, providing a scalable search solution.
We are using ElasticSearch in our pilot experiment to determine the feasibility of implementing Metrics Discovery feature in Blueflood. For the final product that will eventually be used in production, we are also considering Apache Solr.
As the backend for the monitoring system, we need a service offering high availability. As a SaaS product, we need to make it easy to scale our service. A cluster with a good configuration offers these benefits. The challenge is that configuring an ElasticSearch cluster is not a trivial task.
For scalability, we want to avoid the situation when a search request travels across multiple nodes before hitting the right one. Routing that search request into specific index and shard can help avoid this. For example, we can route all metric data associated with the same user into the same index and shard. This makes sense since the most common use case involves searching for data that all belongs to a single account. We achieve a performance gain during retrieval by specifying the correct index and shard. In our experiment, we have determined that routing requests correctly can speed up the process by a staggering 30-50%.
We have also found that there are issues with not pre-initializing indices before indexing documents. ElasticSearch will automatically create an index for you if that index does not already exist. This can be prohibitively expensive as hundreds of requests can arrive at the cluster simultaneously and cause the cluster to go down. It is much more effective to pre-initialize indices. In our experiment, we have found that it takes about one hour to initialize 5,120 shards (128 indices, 20 shards each, 1 replica, 2 nodes). Moreover, we have found that our nodes were very much CPU-bound during initialization.
There are many deployment-related issues that still need to be considered, such as the number of nodes needed to support our traffic and the cluster’s topology. We also need a chef recipe. We also have not fully investigated the impact of tagging metrics with arbitrary annotation on ElasticSearch’s performance. In theory, arbitrary annotations can create a large number of fields, which could cause performance issues.
The Rackspace San Francisco Internship Program develops technical skills in interns while also supporting integration into the office culture. Want to join the team? Rackspace San Francisco is now accepting résumés for Summer 2014 Internships. Email your résumé to SFjobs@rackspace.com or join us at one of these career events: Oregon State University School of Electrical Engineering & Computer Science Senior Dinner October 23rd; Oregon State University Engineering Career Fair October 24th; UC Berkeley Engineering and Science Career Fair September 18th.