The road so far….

August 4, 2014

Introducing Hadoop Development Tools

Filed under: java — Tags: , — Rahul Sharma @ 12:54 pm

Few days back Apache Hadoop Development Tools a.k.a. HDT was released.  The projects aims at bringing plugins in eclipse to simplify development on Hadoop platform. This blog aims to provide an overview of few great features of HDT.

Single Endpoint :

The project can act as a single endpoint for your HDFS, Zookeeper and MR Cluster. You can connect to your HDFS/Zookeeper instance and browse or add more data. You can submit jobs to MR cluster and see status of all the running jobs.  (more…)

October 29, 2012

Lets Crunch big data

Filed under: java — Tags: , — Rahul Sharma @ 2:39 pm

As developers our focus is on simpler, effective solutions and thus one of the most valued principle is “Keep it simple and stupid”. But with Hadoop  map-reduce it was a bit hard to stick to this. If we are evaluating data in multiple Map Reduce jobs we would end up with code that is not related to business but more related to infra. Most of the non-trivial business data processing involves quite a few of  map-reduce tasks. This means longer tread times and  harder to test solutions.

Google presented solution to these issues in their FlumeJava paper. The same paper has been adapted in implementing Apache-Crunch. In a nutshell Crunch is a java  library which simplifies development on MapReduce pipelines. It provides a bunch of  lazily evaluated collections which can be used to perform various operations in form of map reduce jobs. (more…)

June 4, 2010

Testing Hadoop MapReduce Using Fitnesse

Filed under: java — Tags: , — Rahul Sharma @ 9:38 pm

Testing is an important aspect of TDD and we do it by deriving tests for every possible element. Testing components along with the framework, where they are deployed possess its own set of challenges and most of them are from the framework side like the environment in which the framework operates , how much control does it offer etc.

Unit testing Map-Reduce classes was easy part as they can be tested using Mock-Objects but testing using Fitnesse, where we would like to run the whole system on a bunch of Inputs and then retrieving outputs, was a whole different ball game. We had the following challenges in mind while trying to integrate Hadoop and Map-Reudce: (more…)

June 1, 2010

MapReduce and The Cloud

Filed under: java — Tags: , , — Rahul Sharma @ 8:50 pm

The other day I was discussing with one of my colleagues the spring-vmforce-GAE cloud platform, that is yet to be launched, and  the discussion led to MapReduce as a cloud platform. There seems to be an understanding about Hadoop MapReduce as some competitive Cloud platform. (more…)