Monday, October 5, 2015

Hadoop MapReduce jobs in the native Java framework

Developed from the common word count example, this blog demonstrates how to do an inverted index map reduce programming in the native Hadoop Java framework. Run on a dataset consisting of text files, inverted index lists all file names that each word appears in, in the form of:
word1      filename1, filename2, filename3...
word2      filename1, filename2, filename3...
…...
Environment:
Amazon EMR 2.4.6 AMI version Hadoop 1.0.3
Write code and compile in Eclipse using JRE 1.7 
Based on MapReduce 1.0 (Hadoop 1.xx), uses the org.apache.hadoop.mapreduce library

The basic idea is, tokenizer text line in mapper to get each word, output word and filename pair and do aggregation for each word in reducer. 
In the mapper, the Context and FileSplit classes allow you to fetch the name of the current file being processed. 
FileSplit fs = (FileSplit) context.getInputSplit();¬
String location = fs.getPath().getName();

Different form the word count example, mapper output datatype should be modified as <Text, Text> but not <Text and IntWritable>. Note that "String" won’t work. 
In the reducer, remember to take care of duplication removal. 
Export the code as executable jar. 

In the meantime, 
upload input file into HDFS by executing
hadoop dfs -put <filename> /input/directory
In the code, the input and output path were specified as the first and second argument:
FileInputFormat.addInputPath(job, new Path(args[0]));
FileOutputFormat.setOutputPath(job, new Path(args[1]));


The generated invertedindex.jar file could be executed by
hadoop jar invertedindex.jar /input/directory /output/directory
Please note that the /output/directory should not exist beforehand, otherwise the job would stop immediately. 

No comments:

Post a Comment