具有单个映射器和两个不同 reducer 的 hadoop 作业

我有一个大型文档语料库作为 MapReduce 作业(旧的 hadoop API)的输入。在映射器中，我可以生成两种输出:一种计算单词，一种生成 minHash 签名。我需要做的是:

输入的是同一个文档语料库，不需要处理两次。我认为 MultipleOutputs 不是解决方案，因为我找不到将 Mapper 输出提供给两个不同 Reduce 类的方法。

简而言之，我需要的是:

               WordCounting Reducer   --> WordCount output
             /

Input --> Mapper

             \ 
              MinHash Buckets Reducer --> MinHash output

有什么方法可以使用同一个 Mapper(在同一个作业中)，还是应该将其分成两个作业？

最佳答案

你可以做到，但它会涉及一些编码技巧(分区程序和前缀约定)。这个想法是让映射器输出以“W:”为前缀的单词和以“M:”为前缀的 minhash。而不是使用 Partitioner 来决定它需要进入哪个分区(也称为 reducer)。

伪代码主要方法:

Set number of reducers to 2

映射器:

.... parse the word ...
... generate minhash ..
context.write("W:" + word, 1);
context.write("M:" + minhash, 1);

分区器:

IF Key starts with "W:" { return 0; } // reducer 1
IF Key starts with "M:" { return 1; } // reducer 2

组合器:

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
Iterate and context.write all of the values

reducer :

IF Key starts with "W:" { iterate over values and sum; context.write(Key, SUM); return;} 
IF Key starts with "M:" { perform min hash logic }

在输出中，part-0000 将是您的字数统计，而 part-0001 将是您的最小哈希计算。

不幸的是，无法提供不同的 Reducer 类，但您可以使用 IF 和前缀来模拟它。

从性能的角度来看，只有 2 个 reducer 可能效率不高，您可以使用 Partitioner 将前 N 个分区分配给 Word Count。

如果您不喜欢前缀的想法，则需要使用自定义的 WritableComparable 类为键实现二次排序。但只有在更复杂的情况下才值得付出努力。

关于具有单个映射器和两个不同 reducer 的 hadoop 作业，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24363116/