java - Hadoop 中的 Map Reduce 流程

标签 java hadoop mapreduce pseudocode

我正在使用 Hadoop in Practice 一书学习 Hadoop,在阅读第 1 章时,我看到了这个图表:

enter image description here

来自 Hadoop 文档:( http://hadoop.apache.org/docs/current2/api/org/apache/hadoop/mapred/Reducer.html )

1.随机播放

Reducer is input the grouped output of a Mapper. In the phase the framework, for each Reducer, fetches the relevant partition of the output of all the Mappers, via HTTP.

2.排序

The framework groups Reducer inputs by keys (since different Mappers may have output the same key) in this stage. The shuffle and sort phases occur simultaneously i.e. while outputs are being fetched they are merged.

虽然我知道 shufflesorting 同时发生,但我不清楚框架如何决定哪个 reducer 接收哪个映射器 输出。从文档中,似乎每个 reducer 都有办法知道要收集哪个 map 输出,但我不明白如何。

所以我的问题是,鉴于上面的映射器输出,每个 reducer 的最终结果总是相同的吗?如果是这样,实现这一结果的步骤是什么?

感谢任何澄清!

最佳答案

它是 Partitioner这决定了如何将映射器的输出分配给不同的缩减器。

Partitioner controls the partitioning of the keys of the intermediate map-outputs. The key (or a subset of the key) is used to derive the partition, typically by a hash function. The total number of partitions is the same as the number of reduce tasks for the job. Hence this controls which of the m reduce tasks the intermediate key (and hence the record) is sent for reduction.

关于java - Hadoop 中的 Map Reduce 流程,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20916258/

相关文章:

即使我将 numReducetasks 设置为 2,Hadoop 也只会生成一个输出文件

hadoop - 单个 map 中的多种输出格式减少

java - 更新表的特定字段,而不考虑hibernate中的其他字段

java - JBOSS EAP 6 在异步方法之后阻止调用 ejb 方法

java - 超出范围异常错误

hadoop - 如何为 Hive 的分区表指定 HDFS Location

java - Android/Java - 消除顺序加载动画延迟

apache - Apache Tajo 和 Apache hive 之间的实际区别是什么

hadoop - hive 表或 View ?哪个应该是正确的方法?

java - 在 hadoop-examples jar 文件上运行 wordcount 时出现 "Not a valid JAR"