hadoop - 从 hadoop 中的多个 reducer 写入单个文件

标签 hadoop file-io mapreduce hadoop2

我正在尝试使用 Hadoop 运行 Kmeans。我想将在 Reducer 的清理方法中计算的簇的质心保存到某个文件中，比如 centroids.txt。现在，我想知道如果多个 reducer 的清理方法同时启动并且它们都尝试同时写入该文件会发生什么。会在内部处理吗？如果没有，有没有办法同步这个任务？

注意这不是我的reducer输出文件。这是我维护的一个附加文件，用于跟踪质心。我正在使用 reducer 清理方法中的 BufferedWriter 来执行此操作。

最佳答案

Yes you are right. You cannot achieve that using existing framework. Cleanup will be called many times.and you cannot synchronize. Possible approaches you can follow are

Call merge after successful job.

hadoop fs -getmerge <src> <localdst> [addnl]

here

2 Clearly specify where your output file(s) should go. Use this folder as input to your next job.

3 Chain one more MR. where map and reduce don't change the data, and partitioner assigns all data to a single reducer

关于hadoop - 从 hadoop 中的多个 reducer 写入单个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23299144/

上一篇：bash - 如何正确

下一篇：hadoop - Hadoop 中的映射器数量

相关文章：

file - VBA:文件打开到 PDF 页面

amazon-web-services - impala - 它需要 hdfs 和名称节点吗？

hadoop - 如何比较 Hive 的 MR 工作表现？

hadoop - 如何在Hadoop单节点服务器中写入和读取非结构化数据(例如，图像和视频)？

javascript - 在 Firefox 中本地存储 XML 文件

c++ - 如何避免在程序启动时重新加载大数据

java - Hadoop:OutputCollector 在MapReduce 中是如何工作的？

hadoop - 如果集群中map任务比node少怎么办？

java - hadoop中的addCacheFile

hadoop - 运行 S3DistCp 时设置 HDFS 复制因子