java - 修改 MapReduce 中的映射器大小(拆分大小)以获得更快的性能

有没有办法通过改变map任务的数量或者改变每个mapper的split size来提高MapReduce的性能？例如，我有一个 100GB 的文本文件和 20 个节点。我想在文本文件上运行 WordCount 作业，理想的映射器数量或理想的拆分大小是多少才能更快地完成？

使用更多映射器会更快吗？使用较小的拆分大小会更快吗？

编辑

我正在使用 hadoop 2.7.1，所以你知道有 YARN。

最佳答案

当你使用更多的映射器时，它不一定更快。每个映射器都有一个启动和设置时间。在 hadoop 的早期，当 mapreduce 是事实上的标准时，据说映射器应该运行大约 10 分钟。今天文档建议 1 分钟。您可以使用可以在 JobConf 中定义的 setNumMapTasks(int) 来改变 map task 的数量。 .在该方法的文档中有关于映射器计数的非常好的信息:

How many maps?

The number of maps is usually driven by the total size of the inputs i.e. total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 or so for very cpu-light map tasks. Task setup takes awhile, so it is best if the maps take at least a minute to execute.

The default behavior of file-based InputFormats is to split the input into logical InputSplits based on the total size, in bytes, of input files. However, the FileSystem blocksize of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapreduce.input.fileinputformat.split.minsize.

Thus, if you expect 10TB of input data and have a blocksize of 128MB, you'll end up with 82,000 maps, unless setNumMapTasks(int) is used to set it even higher.

您的问题可能与 this SO question. 有关

老实说，也尝试看看现代框架，比如 Apache Spark和 Apache Flink .

关于java - 修改 MapReduce 中的映射器大小(拆分大小)以获得更快的性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35328462/

java - 修改 MapReduce 中的映射器大小(拆分大小)以获得更快的性能

上一篇：sorting - 如何对 Reducer 输出中的逗号分隔键进行排序？

下一篇：oracle - Sqoop 作业因 Oracle 导入的 KiteSDK 验证错误而失败