When I started as a software engineer, forever ago in the early 2000s, the world was simpler.
But today, with Big Data all the rage, it's hard to know where to begin to make sense of the possible architectures, let alone tools. If you attend a tech meet up, you're likely to hear the strangest sentences in conversation:
- "... yeah we're running Hive on Hadoop and writing to Cassandra ..."
- "... oh have you looked into Mongo? We are having some success with Couch though …"
- "... it's a graph so we're looking at Hama or maybe Neo4J …"
In particular, there is still confusion about what to think of the biggest names thrown around in Big Data -- Hadoop and NoSQL. Do I store my data in Hadoop? Or is it that NoSQL makes my database queries faster? Or can MapReduce speed up my web server? (No, to all three, by the way.)
In two articles, I will try to survey these two quite different domains -- Hadoop, and NoSQL -- and a few interesting insights about what to make of the most popular open-source tools in each of them. After this, you'll be able to name-drop with the best of the big data geeks at your next cocktail party!
It's not comprehensive: there are even more projects out there, some gaining a great deal of momentum -- not to mention a number of excellent proprietary products. In two years the landscape will no doubt be quite different. We're undergoing a Cambrian explosion of ideas, projects and tools -- we don't yet know what the surviving "species" in this world are likely to be. But among the dominant creatures in 2011, and the topic of this first part, is …
Apache Hadoop is often described as an implementation of MapReduce, which is a distributed computation paradigm popularized and honed inside of Google. Many articles and tutorials have been written about MapReduce itself; I won't repeat them here.
But in reality, Hadoop is more than MapReduce. It is the center of a miniature solar system of open-source projects from Apache, orbiting the core Hadoop projects. A diagram may explain it better than anything:
Hadoop itself is, in fact, at least two sub-projects: MapReduce and HDFS. Hadoop MapReduce manages computation in the MapReduce paradigm. It concerns starting computation tasks and overseeing their progress. It does not, by itself, have anything to do with storing the data that is input to or output from the MapReduce job.
This has traditionally been the role of Hadoop HDFS, or the Hadoop Distributed File System. As its name implies, HDFS acts like a file system, but one that is by nature distributed over many machines. Chunks of data are replicated across several computers, for reliability and performance. HDFS, like other file systems, represents files as many chunks of bytes; in the case of HDFS's case, these chunks are huge (64MB or more) compared to your computer's file system, as it stores files whose size is measure in terabytes.
It is good for rapid, sequential reads through big files -- which is conveniently exactly how Hadoop MapReduce's workers like to read and write data. HDFS files can't be changed after being written; its files as write-once and append-only. HDFS is the default and perhaps best choice for storing data that will be used with MapReduce. It excels at distributed storage of massive unstructured data like logs files.
Apache HBase is also a storage system, with roots in Hadoop, from which it gets its "H". Though HBase uses HDFS for underlying storage, HBase is designed much more for fast and frequent access to blobs of binary data. It is an example of what most would call a NoSQL column-oriented store; it holds semi-structured values for keys. More on this in the next article; storage is a topic unto itself, as are the relative merits of each platform and what you might use them for.
Apache Cassandra is a prominent and popular feature of the Hadoop landscape. It originated at Facebook, and in turn has its roots in Amazon's Dynamo project. Architecturally, it has more in common with something like HBase than HDFS. That is, it is not a distributed file system, but is also a NoSQL-style store that specializes in quick access to relatively small pieces of data. In comparison to HBase, Cassandra emphasizes tolerating, for example, network failures. Cassandra is also a column-oriented type of store, and again -- this and more deserves its own discussion, next time.
Sitting in between some of these storage systems are two separate projects with a similar purpose: Apache Avro and Apache Thrift. These are not servers or paradigms but rather serialization systems. They provides an easy way to serialize compact data types to bytes for storage in HDFS (or, perhaps, other NoSQL stores) and processing in Hadoop. It is analogous to Google's Protocol Buffers. If you are storing or transmitting complex, structured data types in your Big Data system that is not merely primitive types like integers or strings, you will probably enjoy the convenience of letting a system like Avro or Thrift manage the details of moving those objects around for you, rather than write your own serialization and deserialization.
Apache Pig and Hive are two projects that layer on top of Hadoop, and provide a higher-level language for using Hadoop's MapReduce library. Apache Pig provides a scripting language for describing operations like reading, filtering, transforming, joining, and writing data -- exactly the operations that MapReduce was originally designed for. Rather than expressing these operations in thousands of lines of Java code that uses MapReduce directly, Pig lets users express them in a language not unlike a bash or perl script. Pig is excellent for prototyping and rapidly developing MapReduce-based jobs, as opposed to coding MapReduce jobs in Java itself.
If Pig is "scripting for Hadoop", then Hive is "SQL queries for Hadoop". Apache Hive offers an even more specific and higher-level language, for querying data by running Hadoop jobs, rather than directly scripting step-by-step the operation of several MapReduce jobs on Hadoop. The language is, by design, extremely SQL-like. Hive is still intended as a tool for long-running batch-oriented queries over massive data; it's not "real-time" in any sense. Hive is an excellent tool for analysts and business development types who are accustomed to SQL-like queries and Business Intelligence systems; it will let them easily leverage your shiny new Hadoop cluster to perform ad-hoc queries or generate report data across data stored in storage systems mentioned above.
Higher-level still are two projects that build specific applications on top of Hadoop's MapReduce infrastructure. Apache Mahout implements machine learning algorithms at large scale, including clustering, classification and collaborative filtering. It provides implementations of complete ready-to-run Hadoop jobs for these techniques (as well as some implementations that do not use Hadoop). Mahout is also an excellent way to extract some value out of that data you've been hoarding, and that new Hadoop cluster you brought online: businesses use it to sell more intelligently to customers, for example. Apache Chukwa provides a means to collect, store and analyze logs using Hadoop -- enough said!
Apache Zookeeper is best described as a coordination server. It's useful when processes across many different machines need to coordinate access to a shared resource. In this sense, think of it as like the distributed analog of Java's "synchronized" primitives and concurrency libraries. It can also be used to share small amounts of data for configuration purposes. It is frequently used in systems involving Hadoop, and, is used by HBase directly. Zookeeper should probably be used by more distributed systems out there; anytime one component needs to signal another to do something, such as update data, Zookeeper can be useful.
And finally, Apache Giraph is an early-stage project in Apache's incubator system that, like Hadoop MapReduce, offers a distributed computing paradigm; it is graph-oriented. (Apache Hama is a different project which implements the same sort of graph-oriented view of distributed computation.) Imagine a massive network of small nodes, each of which can conceptually run some computation, emit messages to other neighbors in the network, and receive messages from others. Some algorithms are simply much easier to express in a graph-oriented paradigm like this than in MapReduce -- Google's PageRank is an example. Giraph and Hama are more exotic tools, but are especially appropriate and powerful in cases where your data is related to a graph of some kind -- a social network for example. Expressing processes and algorithms to analyze these networks is likely far more natural using these frameworks.
This concludes a fly-by of the orbit of Apache Hadoop ecosystem. We didn't even visit some of the interesting smaller asteroids like Apache Vaidya. Hopefully it is clear that only a small slice of Hadoop concerns data storage. But in the follow-on article, we'll take a look at many of the very many "NoSQL" stores out there ready to store your Big Data.