hadoop - 如何直接从mapper输出到HDFS？

在某些条件下，我们希望映射器完成所有工作并输出到 HDFS，我们不希望将数据传输到 reducer (将使用额外的带宽，如果有错误请纠正我)。

一个伪代码是:

def mapper(k,v_list):
  for v in v_list:
    if criteria:
      write to HDFS
    else:
      emit

我发现这很难，因为我们唯一可以玩的就是 OutputCollector。我想到的一件事是扩展 OutputCollector，覆盖 OutputCollector.collect 并执行这些操作。有没有更好的方法？

最佳答案

您可以使用 JobConf.setNumReduceTasks(0) 将 reduce 任务的数量设置为 0。这将使映射器的结果直接进入 HDFS。

来自 Map-Reduce 手册:http://hadoop.apache.org/docs/current/hadoop-mapreduce-client/hadoop-mapreduce-client-core/MapReduceTutorial.html

Reducer NONE
It is legal to set the number of reduce-tasks to zero if no reduction is desired.

In this case the outputs of the map-tasks go directly to the FileSystem, 
into the output path set by setOutputPath(Path). The framework does not sort 
the map-outputs before writing them out to the FileSystem.

关于hadoop - 如何直接从mapper输出到HDFS？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/10289797/

上一篇：java - 单个 hadoop Mapper 对象是否用于多次调用 map()？

下一篇：plsql - 将 PL/SQL 转换为 Hive QL

相关文章：

hadoop - 在 map & reduce 之后运行几行

maven - 错误:无法找到或加载主类org.apache.mahout.driver.MahoutDriver

Scala - 减少功能

hadoop - 在 MapReduce 中写入多个 O/P 文件时出现问题

hadoop - 从机上的DiskErrorException-Hadoop多节点

hadoop - Pig中是否可以通过类似于hadoop -archives的方式来发送存档文件

python - PySpark安装错误

c# - Windows中带有MR2的NullPointerException

hadoop - 如何在 MapReduce 中使用 ORCFile 输入/输出格式？

来自 HBase 的 Hadoop mapreduce 流