hadoop - 配置 Hadoop 以将输入文件作为一个映射任务处理

我正在使用一个 200MB 的文件执行 MapReduce。我的目标是完成 1 个 map task 。我做了:

Configuration conf = new Configuration();
conf.set("mapred.min.split.size","999999999999999");

但是，似乎记录的数量限制了我。是 split map task 的原因吗？如果是这样，我可以做些什么来改变它？

14/03/20 00:12:04 INFO mapred.MapTask: data buffer = 79691776/99614720
14/03/20 00:12:04 INFO mapred.MapTask: record buffer = 262144/327680
14/03/20 00:12:05 INFO mapred.MapTask: Spilling map output: record full = true

最佳答案

mapred.min.split.size 通常构成创建输入拆分的下限，而 DFS block 大小为 128MB。因此，在您的情况下，下限大于上限，而且 hadoop 似乎并不关心这一点，而是采用上限并相应地拆分输入数据。

引用自维基:

Actually controlling the number of maps is subtle. The mapred.map.tasks parameter is just a hint to the InputFormat for the number of maps. The default InputFormat behavior is to split the total number of bytes into the right number of fragments. However, in the default case the DFS block size of the input files is treated as an upper bound for input splits. A lower bound on the split size can be set via mapred.min.split.size. Thus, if you expect 10TB of input data and have 128MB DFS blocks, you'll end up with 82k maps, unless your mapred.map.tasks is even larger. Ultimately the InputFormat determines the number of maps.

给你的提示在最后一句，所以如果你想控制映射器的数量，你必须覆盖InputFormat，一般我们使用FileInputFormat，它是isSplittable() 方法需要被覆盖以返回 false。这将确保每个文件有一个映射器。像下面这样的东西就足够了:

Class NonSplittableFileInputFormat extends FileInputFormat{

    @Override
    public boolean isSplitable(FileSystem fs, Path filename){ 
        return false; 
    }
}

关于hadoop - 配置 Hadoop 以将输入文件作为一个映射任务处理，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22512037/

hadoop - 配置 Hadoop 以将输入文件作为一个映射任务处理

上一篇：join - hadoop pig自连接性能

下一篇：hadoop - Sqoop 错误外来输入 't1' 期望 EOF 接近 '<EOF>'