The road so far….

October 29, 2012

Lets Crunch big data

Filed under: java — Tags: , — Rahul Sharma @ 2:39 pm

As developers our focus is on simpler, effective solutions and thus one of the most valued principle is “Keep it simple and stupid”. But with Hadoop  map-reduce it was a bit hard to stick to this. If we are evaluating data in multiple Map Reduce jobs we would end up with code that is not related to business but more related to infra. Most of the non-trivial business data processing involves quite a few of  map-reduce tasks. This means longer tread times and  harder to test solutions.

Google presented solution to these issues in their FlumeJava paper. The same paper has been adapted in implementing Apache-Crunch. In a nutshell Crunch is a java  library which simplifies development on MapReduce pipelines. It provides a bunch of  lazily evaluated collections which can be used to perform various operations in form of map reduce jobs.

Here is what Brock Noland said in one of posts while introducing Crunch

Using Crunch, a Java programmer with limited knowledge of Hadoop and MapReduce can utilize the Hadoop cluster. The program is written in pure Java and does not require the use of MapReduce specific constructs such as writing a Mapper, Reducer, or using Writable objects to wrap Java primitives.

Crunch supports reading data from various sources like sequence files, avro, text , hbase,  jdbc  with a simple read API

<T> PCollection<T> read(Source<T> source)

You can import data in various formats like json, avro, thrift etc and perform efficient joins, aggregation, sort, cartesian and filter operations. Additionally any  custom operations over these collections is quite easy to cook. All you have to do is to implement the quite simple and to the point,  DoFn interface. You can unit test you implementations of DoFn without any map-reduce constructs.

I am not putting any example to use it. It is quite simple and the same can be found out on Apache-Crunch site.

Alternatively you could generate a project from the available crunch-archetype. This will also  generate a simple WordCount example. The archetype can be selected using :

mvn archetype:generate -Dfilter=crunch-archetype

The project has quite a few examples for its different aspects and is also available in Scala.

So now lets CRUNCH some data !!!



  1. Hi Rahul,
    I found a presentation you gave at Indicthreads that referenced some Avro Crunch examples. Any chance you can post them someplace?

    Comment by Raheem Daya — December 31, 2012 @ 2:04 am

  2. Reblogged this on HadoopEssentials.

    Comment by Nitin Kumar — August 17, 2014 @ 7:19 am

RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: