apache-spark - AWS EMR Spark 应用程序 - CPU 和内存利用率不佳

标签 apache-spark spark-streaming emr amazon-emr

我正在 AWS EMR(3 个节点 * m4.4xlarge 集群 - 每个节点 16vCPU 和 64G RAM)上运行 Spark Streaming 应用程序的 2 个副本(Spark 2.2.1、EMR 5.11、Scala)。

在内置 EMR 集群监控(Ganglia)中，我看到集群的 CPU 利用率低于 30%，内存使用量不超过 32GB，可用容量约为 200GB，网络也远未达到 100%。但应用程序几乎无法在批处理间隔内完成批处理。

以下是我使用客户端模式将应用程序的每个副本提交给 Master 的参数:

--master yarn
--num-executors 2
--executor-cores 20
--executor-memory 20G
--conf spark.driver.memory=4G
--conf spark.driver.cores=3

如何才能实现更好的资源利用率(应用性能)？

最佳答案

Using maximizeResourceAllocation from aws docs there all these things are discussed in detail. Read it completely

You can configure your executors to utilize the maximum resources possible on each node in a cluster by using the spark configuration classification to set maximizeResourceAllocation option to true. This EMR-specific option calculates the maximum compute and memory resources available for an executor on an instance in the core instance group. It then sets the corresponding spark-defaults settings based on this information.
[
  {
    "Classification": "spark",
    "Properties": {
      "maximizeResourceAllocation": "true"
    }
  }
]

进一步阅读

关于apache-spark - AWS EMR Spark 应用程序 - CPU 和内存利用率不佳，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48211775/

上一篇：typo3 - TypoScript 中的数据类型 "array"到底是什么

下一篇：html - SCSS 下拉悬停菜单

相关文章：

scala - 如何在SCALA中访问Row RDD中的元素

scala - spark scala 数据帧时间戳转换排序？

apache-spark - Spark 结构化流在追加模式下显示结果太迟

java - Spark DirectStream 问题

hadoop - Spark提交无法在EMR上运行-Oozie Launcher失败，尝试将apacheds-i18n-2.0.0-M15.jar多次添加到分布式缓存中

hadoop - 不能在 Hive 表列名中使用 "."

scala - 了解 Spark WindowSpec#rangeBetween

hadoop - Spark 流: How to process using multiple inputs to job?

java - Kinesis Spark 流确实读取记录 : running in standalone cluster

amazon-s3 - Spark-1.4.1 saveAsTextFile 到 S3 在 emr-4.0.0 上非常慢