(总感觉上一篇的实现有问题)http://www.cnblogs.com/i80386/p/3444726.html combiner是把同一个机器上的多个map的结果先聚合一次 现重新实现一个:
思路: 第一个mapreduce仅仅做 的统计,即某个单词在某一篇文章里出现的次数。(原理跟wordcount一样,只是word变成了word_docid) 第二个mapreduce将word_docid在map阶段拆开,重新组合为 然后在combine和reduce阶段(combine和reduce是同一个函数)组合为 这种格式import java.io.IOException;
1 思路: 0.txt MapReduce is simple 1.txt MapReduce is powerfull is simple 2.txt Hello MapReduce bye MapReduce 采用两个JOB的形式实现 一:第一个JOB(跟wordcount一致,只是wordcount中的word换做了word:dicid) 1 map函数:context.write(word:docid, 1) 即将word:docid作为map函数的输出 输出key 输出value MapReduce:0.txt 1 is:0.txt 1 simple:0.txt 1 Mapreduce:1.txt 1 is:1.txt 1 powerfull:1.txt 1 is:1.txt 1 simple:1.txt 1 Hello:2.txt 1 MapReduce:2.txt 1 bye:2.txt 1 MapReduce:2.txt 1 2 Partitioner函数:HashPartitioner 略,根据map函数的输出key(word:docid)进行分区 3 reduce函数:累加输入values 输出key 输出value MapReduce:0.txt 1 => MapReduce 0.txt:1 is:0.txt 1 => is 0.txt:1 simple:0.txt 1 => simple 0.txt:1 Mapreduce:1.txt 1 => Mapreduce 1.txt:1 is:1.txt 2 => is 1.txt:2 powerfull:1.txt 1 => powerfull 1.txt:1 simple:1.txt 1 => simple 1.txt:1 Hello:2.txt 1 => Hello 2.txt:1 MapReduce:2.txt 2 => MapReduce 2.txt:2 bye:2.txt 1 => bye 2.txt:1 二:第二个JOB 1 map函数: 输入key 输入value 输出key 输出value MapReduce:0.txt 1 => MapReduce 0.txt:1 is:0.txt 1 => is 0.txt:1 simple:0.txt 1 => simple 0.txt:1 Mapreduce:1.txt 1 => Mapreduce 1.txt:1 is:1.txt 2 => is 1.txt:2 powerfull:1.txt 1 => powerfull 1.txt:1 simple:1.txt 1 => simple 1.txt:1 Hello:2.txt 1 => Hello 2.txt:1 MapReduce:2.txt 2 => MapReduce 2 2 reduce函数 (组合values) 输出key 输出value MapReduce 0.txt:1,1.txt:1 2.txt:2 is 0.txt:1,is 1.txt:2 simple 0.txt:1,1.txt:1 powerfull 1.txt:1 Hello 2.txt:1 bye 2.txt:1
import java.util.Random;import org.apache.hadoop.conf.Configuration;import org.apache.hadoop.fs.FileSystem;import org.apache.hadoop.fs.Path;import org.apache.hadoop.io.Text;import org.apache.hadoop.io.IntWritable;import org.apache.hadoop.mapreduce.Job;import org.apache.hadoop.mapreduce.Mapper;import org.apache.hadoop.mapreduce.Reducer;import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;import org.apache.hadoop.mapreduce.lib.input.FileSplit;import org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;import org.apache.hadoop.mapreduce.lib.partition.HashPartitioner;import org.apache.hadoop.mapreduce.lib.reduce.IntSumReducer;public class MyInvertIndex { public static class SplitMapper extends Mapper
运行结果:Hello 2.txt:1,,MapReduce 2.txt:2,1.txt:1,0.txt:1,,bye 2.txt:1,,is 1.txt:2,0.txt:1,,powerfull 1.txt:1,,simple 1.txt:1,0.txt:1,,