java - Hadoop WordCount 组合器

https://hadoop.apache.org/docs/stable/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html#Source_Code

在字数统计示例中，reduce 函数被用作合并器和缩减器。

   public static class IntSumReducer extends Reducer<Text, IntWritable, Text,IntWritable> {

      public void reduce(Text key, Iterable<IntWritable> values, Context context) 
    throws IOException, InterruptedException {
       int sum = 0;
       for (IntWritable val : values) {
           sum += val.get();
       }
       context.write(key, new IntWritable(sum));
   }
  }

我理解 reducer 的工作方式，但是对于 combiner，假设我的输入是

  <Java,1> <Virtual,1> <Machine,1> <Java,1>

它考虑第一个 kv 对并给出相同的输出...!!??因为我只有一个值。它怎么会同时考虑键和 make

  <Java,1,1>

因为我们一次只考虑一个 kv 对？我知道这是一个错误的假设；有人请纠正我这个问题

最佳答案

IntSumReducer 类 继承了 Reducer 类，如果我们查看 documentation，Reducer 类会在这里发挥作用。

"Reduces a set of intermediate values which share a key to a smaller set of values. Reducer implementations can access the Configuration for the job via the JobContext.getConfiguration() method.

Reducer has 3 primary phases:

Shuffle:The Reducer copies the sorted output from each Mapper using HTTP across the network.

Sort:The framework merge sorts Reducer inputs by keys (since different Mappers may have output the same key).

The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged."

程序调用同一个类进行合并和归约操作；

job.setMapperClass(TokenizerMapper.class);
job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

所以我发现如果我们只使用一个数据节点，我们不一定要为这个wordcount 程序因为 reducer 类本身负责组合器的工作。

job.setMapperClass(TokenizerMapper.class); job.setReducerClass(IntSumReducer.class);

如果只使用一个数据节点，上述方法对wordcount程序也有同样的效果。

关于java - Hadoop WordCount 组合器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40036518/

java - Hadoop WordCount 组合器

上一篇：scala - 为什么 apache spark 中的这两个阶段计算的是同一件事？

下一篇：hadoop - 为 hdfs 用户获取 Kerberos ticker 会引发错误