hadoop - Hadoop MapReduce 何时运行组合器的权威来源

已经有很多这样的问题，答案相互矛盾。我还在文献和博客中发现了相互矛盾的说法。在 Hadoop 权威指南一书中，它说

Hadoop does not provide a guarantee of how many times it will call [the combiner] for a particular map output record, if at all. In other words, calling the combiner function zero, one or many times should produce the same output from the reducer

此处类似问题的答案 On what basis mapreduce framework decides whether to launch a combiner or not建议组合器(如果已定义)将始终被调用一次，因为需要刷新 MapOutputBuffer。

可能存在映射器仅发出一次的边缘情况，这意味着组合器即使已定义也不会运行。

我的问题是:这个问题的答案是否有明确的来源？当然，我已经搜索了 Hadoop 文档，但找不到任何内容。

最佳答案

Hadoop 框架旨在为用户/开发人员提供一个简单的接口(interface)来开发在分布式环境中运行的代码，而无需用户/开发人员思考/处理分布式系统的复杂性。

要回答您的问题，您可以阅读源代码，其中包含根据条件调用组合器的逻辑。

1950 行 - 1955 行 https://github.com/apache/hadoop/blob/0b8a7c18ddbe73b356b3c9baf4460659ccaee095/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/MapTask.java

 if (combinerRunner == null || numSpills < minSpillsForCombine) {
     Merger.writeFile(kvIter, writer, reporter, job);
 } else {
     combineCollector.setWriter(writer);
     combinerRunner.combine(kvIter, combineCollector);
 }

如果出现以下情况，Combiner 将不会运行:

没有定义，或者
如果溢出小于 minSpillsForCombine。 minSpillForCombine 由属性“mapreduce.map.combine.minspills”驱动，其默认值为 3。

由于大多数 hadoop 属性都是可配置的，因此行为和性能取决于您如何配置属性。

希望这能回答您的问题。

关于hadoop - Hadoop MapReduce 何时运行组合器的权威来源，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43393720/

hadoop - Hadoop MapReduce 何时运行组合器的权威来源

上一篇：amazon-web-services - 有没有办法在运行时配置和更改 Yarn 调度程序？

下一篇：hadoop - 只能复制到 0 个节点而不是 minReplication (=1)。有 2 个数据节点正在运行，并且没有节点被排除在此操作中