The road so far….

June 4, 2010

Testing Hadoop MapReduce Using Fitnesse

Filed under: java — Tags: , — Rahul Sharma @ 9:38 pm

Testing is an important aspect of TDD and we do it by deriving tests for every possible element. Testing components along with the framework, where they are deployed possess its own set of challenges and most of them are from the framework side like the environment in which the framework operates , how much control does it offer etc.

Unit testing Map-Reduce classes was easy part as they can be tested using Mock-Objects but testing using Fitnesse, where we would like to run the whole system on a bunch of Inputs and then retrieving outputs, was a whole different ball game. We had the following challenges in mind while trying to integrate Hadoop and Map-Reudce:

  1. You run a job on Hadoop using command line with all kinds of arguments eg hadoop jar addition.jar Sum input.txt output, so how can we run it through java program ?
  2. Hadoop runs on Linux and on windows it requires the CgWin environment. The integrated system should work for both environments as most of the developers had a Windows Machine.
  3. Hadoop outputs the data to output file in the specified directory so we need to somehow retrieve the output from the directory.

For problem one we looked into the Hadoop executable scripts. What we found out was when we hit the command, Hadoop internally tries to find the Main-Class specified, and shoots it with the required arguments. So if we can directly call the class it should just run if Hadoop does not do something special before initializing it. The executable script of Hadoop setup revels that most of the code is about finding the right class and not much setup goes under the hood. So we can safely call the class from out test-case and see it work.!!

@Test
 public void TryAddingNos() throws Exception {
      Addition example = new Addition();
      String inp = "input/sum.txt";
      String out = "target/output";
      example.runtask(inp, out);
}

Next we have one more problem, the setup would run only on Linux machines and running it on windows would give you all sorts of command not found exceptions. Basically Hadoop executes various Linux commands and the same could not be found on a windows system and gives error. We faced it with chmod command. So we really can not test it windows unless somehow mock the Linux environment. We left the issue of windows as it is and tried to integrate it on a Linux machine.

java.io.IOException: Cannot run program "chmod": CreateProcess error=2, The system cannot find the file specified
 at java.lang.ProcessBuilder.start(ProcessBuilder.java:459)
 at org.apache.hadoop.util.Shell.runCommand(Shell.java:149)
 at org.apache.hadoop.util.Shell.run(Shell.java:134)
 at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:286)

Next we must specify a configuration environment in which we are going to run the program i.e. the jobtracker and other properties. For the test case it was an overkill to run the tests under clustered kind of environment so we made a configuration for a local machine run and then initialized our program with it for test-case.

<configuration>
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
<property>
<name>mapred.job.tracker</name>
<value>local</value>
</property>
</configuration>

Now how can we retrieve the processed output data for our Map-Reduce program. Should we try to read the output directory or something else ? Basically the framework writes to the file using a RecordWriter class that it will create on the fly and does not give a handle back to the object that has been created. Thus here we tried to Mock the RecordWriter. We created a decorator class for RecordWriter. The class would delegate all operation to a real RecordWriter. But for the write operation, where it is going to write to a file, we also stored the data in a LinkedHashMap.

class FitMockRecordwriter extends RecordWriter {
 Map foundData;
 RecordWriter<IntWritable, IntWritable> recordWriter;

 public FitMockRecordwriter() {
   foundData = new LinkedHashMap();
 }

 @Override
 public void close(TaskAttemptContext arg0) throws IOException,InterruptedException {
    this.recordWriter.close(arg0);
 }

 @Override
 public void write(IntWritable arg0, IntWritable arg1) throws IOException,InterruptedException {
   this.recordWriter.write(arg0, arg1);
   foundData.put(arg0.get(), arg1.get());
 }
}

The framework creates an instance of RecordWriter using Outputformat so we were also required to mock the OutputFormat so as to return back an instance of MockRecordWriter when asked and store the same in a static object. We used static because the framework would use this class and then would not give a handle to the object used, making it static would keep the data in the class rather object instance.

class FitMockOutputFormatter extends TextOutputFormat {
 static FitMockRecordwriter fitMockRecordwriter;

 @Override
 public RecordWriter getRecordWriter(TaskAttemptContext arg0) throws IOException, InterruptedException {
    FitMockRecordwriter recordwriter = new FitMockRecordwriter();
    recordwriter.recordWriter = super.getRecordWriter(arg0);
    fitMockRecordwriter = recordwriter;
    return recordwriter;
 }
}

Now we have all the elements to make a fitnesse test and we just need to make a fixture with all these components and run it. We also need to modify out configuration class so as to have the OutputFormat class from outside rather harcording the TextOutputFormat or some other class. After this we can run fitness, make a page on the fit wiki with all the required imports (hadoop jar, commons-loginng.jar)  and the Fixture with data and test it.

public class MapReduceFixture extends ColumnFixture {
 public String testText;
 public String[] testData;

 public String[] performAddition() throws Exception {
    Addition addition= new Addition();
    String inp = createFileWithTestData();
    String out = "target/output/" + Calendar.getInstance().getTimeInMillis();
    FitMockOutputFormatter.fitMockRecordwriter = null;
    addition.runtask(inp, out, FitMockOutputFormatter.class);
    return getDataFromMap();
 }

 private String createFileWithTestData() throws Exception {
     File file = File.createTempFile("hadoop-sum", ".in");
     PrintWriter printWriter = new PrintWriter(file);
     for (String data : testData) {
        printWriter.println(data);
     }
     printWriter.close();
     return file.getAbsolutePath();
 }

 private String[] getDataFromMap() {
     FitMockRecordwriter fitMockRecordwriter = FitMockOutputFormatter.fitMockRecordwriter;
     Set keySet = fitMockRecordwriter.foundData.keySet();
     String[] data = new String[keySet.size()];
     int count = 0;
     for (Object key : keySet) {
       data[count++] = fitMockRecordwriter.foundData.get(key).toString();
     }
     return data;
 }
}
Advertisements

3 Comments »

  1. Sir I’m new to Hadoop and confused configuration part, not getting where i have to
    placed this configuration file you mentioned as below:-

    fs.default.name
    file:///

    mapred.job.tracker
    local

    Comment by Raj — June 17, 2014 @ 5:00 pm

    • This configuration needs to go into core-site.xml/mapreduce-site.xml files available in the classpath. Since you are new my recommendation is rather trying to get hadoop up with fitnesse try out simpler examples.

      Comment by Rahul Sharma — June 19, 2014 @ 8:50 am

  2. Reblogged this on HadoopEssentials.

    Comment by Nitin Kumar — August 17, 2014 @ 7:19 am


RSS feed for comments on this post. TrackBack URI

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: