java - 在 Hadoop 中选择不同的记录并使用组合器

“MapReduce Design Patterns”一书包含用于在数据集中查找不同记录的模式。这是算法:

map(key, record):
    emit record, null

reduce(key, records):
    emit key

第 66 页说:

The Combiner can always be utilized in this pattern and can help if there are a large number of duplicates.

map 阶段发出记录和 NullWritable(不在线路上写入)。 Combiner 试图减少什么？没有减少的记录。

最佳答案

它试图减少 map 输出中的重复项。

假设您在每一行中都有单词的文本数据:

John
Adam
John
John

如果您可以在 map 阶段之后将它们组合起来并且只发送:

John
Adam

这对于每个映射器来说已经是不同的 - 如果您的拆分中有相当数量的非不同记录，那么可以节省带宽。

关于java - 在 Hadoop 中选择不同的记录并使用组合器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24785893/

相关文章：

java - 获取 Math.abs 的十进制数输入/输出