hadoop - 如何在 Amazon EMR 上调整 Hadoop MapReduce 参数?

标签 hadoop memory hadoop2 emr amazon-emr

我的 MR 作业在 map 100% reduce 35% 结束时出现了很多类似于 running beyond physical memory limits 的错误消息。当前使用情况:已使用 3.0 GB 的 3 GB 物理内存;使用了 3.7 GB 的 15 GB 虚拟内存。杀死容器。

我输入的*.bz2文件大约是4GB,如果我解压它,它的大小大约是38GB,用one Master<运行这个作业大约需要一个小时 和 Amazon EMR 上的 两个奴隶

我的问题是
- 为什么这个作业占用这么多内存?
- 为什么这项工作需要大约一个小时? Usually running a 40GB wordcount job on a small 4-node cluster takes about 10 mins.
- 如何调整MR参数来解决这个问题?
- 哪个 Amazon EC2 Instance types 最适合解决这个问题?

请引用以下日志:
- 物理内存(字节)snapshot=43327889408 => 43.3GB
- 虚拟内存(字节)snapshot=108950675456 => 108.95GB
- 总提交堆使用量(字节)=34940649472 => 34.94GB

我提出的解决方案如下,但我不确定它们是否是正确的解决方案
- 使用内存至少为 8GB 的​​更大的 Amazon EC2 实例
- 使用以下代码调整 MR 参数

版本 1:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest1");
//don't kill the container, if the physical memory exceeds "mapreduce.reduce.memory.mb" or "mapreduce.map.memory.mb"
conf.setBoolean("yarn.nodemanager.pmem-check-enabled", false);
conf.setBoolean("yarn.nodemanager.vmem-check-enabled", false);

版本 2:

Configuration conf = new Configuration();
Job job = Job.getInstance(conf, "jobtest2");
//conf.set("mapreduce.input.fileinputformat.split.minsize","3073741824");                                                                   
conf.set("mapreduce.map.memory.mb", "8192");                                     
conf.set("mapreduce.map.java.opts", "-Xmx6144m");                                         
conf.set("mapreduce.reduce.memory.mb", "8192");                                         
conf.set("mapreduce.reduce.java.opts", "-Xmx6144m");                                             

日志:

15/11/08 11:37:27 INFO mapreduce.Job:  map 100% reduce 35%
15/11/08 11:37:27 INFO mapreduce.Job: Task Id : attempt_1446749367313_0006_r_000006_2, Status : FAILED
Container [pid=24745,containerID=container_1446749367313_0006_01_003145] is running beyond physical memory limits. Current usage: 3.0 GB of 3 GB physical memory used; 3.7 GB of 15 GB virtual memory used. Killing container.
Dump of the process-tree for container_1446749367313_0006_01_003145 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 24745 24743 24745 24745 (bash) 0 0 9658368 291 /bin/bash -c /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN  -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 1>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stdout 2>/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145/stderr  
    |- 24749 24745 24745 24745 (java) 14124 1281 3910426624 789477 /usr/lib/jvm/java-openjdk/bin/java -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx2304m -Djava.io.tmpdir=/mnt1/yarn/usercache/ec2-user/appcache/application_1446749367313_0006/container_1446749367313_0006_01_003145/tmp -Dlog4j.configuration=container-log4j.properties -Dyarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1446749367313_0006/container_1446749367313_0006_01_003145 -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA org.apache.hadoop.mapred.YarnChild **.***.***.*** 32846 attempt_1446749367313_0006_r_000006_2 3145 

Container killed on request. Exit code is 143
Container exited with a non-zero exit code 143

15/11/08 11:37:28 INFO mapreduce.Job:  map 100% reduce 25%
15/11/08 11:37:30 INFO mapreduce.Job:  map 100% reduce 26%
15/11/08 11:37:37 INFO mapreduce.Job:  map 100% reduce 27%
15/11/08 11:37:42 INFO mapreduce.Job:  map 100% reduce 28%
15/11/08 11:37:53 INFO mapreduce.Job:  map 100% reduce 29%
15/11/08 11:37:57 INFO mapreduce.Job:  map 100% reduce 34%
15/11/08 11:38:02 INFO mapreduce.Job:  map 100% reduce 35%
15/11/08 11:38:13 INFO mapreduce.Job:  map 100% reduce 36%
15/11/08 11:38:22 INFO mapreduce.Job:  map 100% reduce 37%
15/11/08 11:38:35 INFO mapreduce.Job:  map 100% reduce 42%
15/11/08 11:38:36 INFO mapreduce.Job:  map 100% reduce 100%
15/11/08 11:38:36 INFO mapreduce.Job: Job job_1446749367313_0006 failed with state FAILED due to: Task failed task_1446749367313_0006_r_000001
Job failed as tasks failed. failedMaps:0 failedReduces:1

15/11/08 11:38:36 INFO mapreduce.Job: Counters: 43
    File System Counters
        FILE: Number of bytes read=11806418671
        FILE: Number of bytes written=22240791936
        FILE: Number of read operations=0
        FILE: Number of large read operations=0
        FILE: Number of write operations=0
        HDFS: Number of bytes read=16874
        HDFS: Number of bytes written=0
        HDFS: Number of read operations=59
        HDFS: Number of large read operations=0
        HDFS: Number of write operations=0
        S3: Number of bytes read=3942336319
        S3: Number of bytes written=0
        S3: Number of read operations=0
        S3: Number of large read operations=0
        S3: Number of write operations=0
    Job Counters 
        Failed reduce tasks=22
        Killed reduce tasks=5
        Launched map tasks=59
        Launched reduce tasks=27
        Data-local map tasks=59
        Total time spent by all maps in occupied slots (ms)=114327828
        Total time spent by all reduces in occupied slots (ms)=131855700
        Total time spent by all map tasks (ms)=19054638
        Total time spent by all reduce tasks (ms)=10987975
        Total vcore-seconds taken by all map tasks=19054638
        Total vcore-seconds taken by all reduce tasks=10987975
        Total megabyte-seconds taken by all map tasks=27438678720
        Total megabyte-seconds taken by all reduce tasks=31645368000
    Map-Reduce Framework
        Map input records=728795619
        Map output records=728795618
        Map output bytes=50859151614
        Map output materialized bytes=10506705085
        Input split bytes=16874
        Combine input records=0
        Spilled Records=1457591236
        Failed Shuffles=0
        Merged Map outputs=0
        GC time elapsed (ms)=150143
        CPU time spent (ms)=14360870
        Physical memory (bytes) snapshot=43327889408
        Virtual memory (bytes) snapshot=108950675456
        Total committed heap usage (bytes)=34940649472
    File Input Format Counters 
        Bytes Read=0

最佳答案

我不确定 Amazon EMR。关于 map reduce 需要考虑的几点:

  1. bzip2 较慢,但它比 gzip 压缩得更好。 bzip2 的解压缩速度比压缩速度快,但仍然比其他格式慢。所以在较高的层次上,你已经有了这个与在十分钟内运行的 40gb 字数统计程序的比较。(假设 40gb 程序没有压缩)。下一个问题是,但是慢了多少

  2. 但是,您的作业在一小时后仍然失败。请确认这一点。所以只有当作业成功运行时,我们才能考虑性能。出于这个原因,让我们想想它为什么会失败。 你遇到了内存错误。同样基于错误,容器在 reducer 阶段失败(因为 mapper 阶段已完成 100%)。大多数情况下,甚至没有一个 reducer 可能会成功。尽管 32% 可能会让您误以为某些 reducer 运行了,但该百分比可能是由于在首次运行 reducer 之前准备清理工作所致。一种确认方法是,查看是否生成了任何 reducer 输出文件。

一旦确认没有任何 reducer 运行,您可以根据您的版本 2 增加容器的内存。

您的版本 1 将帮助您查看是否只有特定容器导致问题并允许作业完成。

关于hadoop - 如何在 Amazon EMR 上调整 Hadoop MapReduce 参数?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33598202/

相关文章:

java - 将输入 arff 文件拆分成更小的 block 以处理非常大的数据集

ios - .removeFromSuperview 不适用于 UIImageView

hadoop - Apache Hive 安装在伪分布式或多节点集群环境

hadoop - hadoop 2中的HDFS是否具有辅助名称节点?

java - 是否可以使现有的 mapreduce 程序从输入文件的指定偏移量开始运行

java - 为什么 UnixSystem().getUsername() 返回 null(Docker 中的 OpenJDK 17)

r - 用于查询大数据的 64 位 R 的 32 位数据库驱动程序

hadoop - 当 parquet 使用 Snappy 算法而不是 gzip 时,将 parquet 数据写入 hive 的 spark 作业卡在了最后一个任务中

memory - 内存对齐的目的

python - 循环读取 XML 文件最终会导致内存错误