hadoop - 使用自定义输入格式时 MapReduce 性能降低

标签 hadoop mapreduce

我在使用 MapReduce 时遇到问题。我不得不阅读多个 CSV 文件。

1 个 CSV 文件输出 1 行。

我无法以自定义输入格式拆分 CSV 文件，因为 CSV 文件中的行格式不同。例如:

第 1 行包含 A、B、C 第 2 行包含 D、E、F

我的输出值应该是A, B, D, F

我有 1100 个 CSV 文件，因此创建了 1100 个拆分，因此创建了 1100 个映射器。映射器非常简单，处理起来不会花费太多时间。

但是 1100 个输入文件需要大量时间来处理。

任何人都可以指导我看什么，或者如果我在这种方法中做错了什么？

最佳答案

与处理大量小文件相比，Hadoop 在处理少量大文件时表现更好。 (这里的“小”意味着比 Hadoop 分布式文件系统 (HDFS) block 小得多。) Cloudera blog post 中对此的技术原因进行了很好的解释

Map tasks usually process a block of input at a time (using the default FileInputFormat). If the file is very small and there are a lot of them, then each map task processes very little input, and there are a lot more map tasks, each of which imposes extra bookkeeping overhead. Compare a 1GB file broken into 16 64MB blocks, and 10,000 or so 100KB files. The 10,000 files use one map each, and the job time can be tens or hundreds of times slower than the equivalent one with a single input file.

可以引用这个link获得解决这个问题的方法

关于hadoop - 使用自定义输入格式时 MapReduce 性能降低，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22854553/

上一篇：hadoop - Hortonworks 数据节点安装 : Exception in secureMain

下一篇：windows - HDInsight 错误

相关文章：

hadoop - 为Kafka主题创建Hive Table Producer

hadoop - Squirrel 访问 Phoenix/HBase

java - 原因和 java.lang.NullPointerException 错误修复

hadoop - 什么时候我们不应该在配置单元中使用分桶？

hadoop - 由于找不到方法错误，在 Hbase 中上传 HFiles 失败

工作流中中间作业的 Hadoop SequenceFile 输入/输出

hadoop - hadoop fs -ls:从服务器/127.0.1.1到本地主机的调用失败

java - Hadoop Mapreduce - 来自 10000 对列表的前 n 个和后 n 个值

java - 为什么 Hadoop 无法在本地模式下找到这个文件，即使它存在？

Hadoop MapReduce : Possible to define two mappers and reducers in one hadoop job class?