hadoop - Spark 的 Yarn 集群优化

我正在尝试为我的 4 节点集群配置 Yarn 和 Spark。
每个节点都有以下规范:

24 核
23.5 GB 内存
换掉

到目前为止，我配置了 Yarn 和 Spark，Spark 可以执行 SparkPi 示例计算，但这仅在 yarn-site.xml 的以下配置下有效:

<configuration>
<property>
        <name>yarn.acl.enable</name>
        <value>0</value>
</property>

<property>
        <name>yarn.resourcemanager.hostname</name>
        <value>ds11</value>
</property>

<property>
        <name>yarn.nodemanager.resource.memory-mb</name>
        <value>20480</value>
</property>

<property>
        <name>yarn.scheduler.maximum-allocation-mb</name>
        <value>20480</value>
</property>

<property>
        <name>yarn.scheduler.minimum-allocation-mb</name>
        <value>1536</value>
</property>

<property>
        <name>yarn.nodemanager.vmem-check-enabled</name>
        <value>false</value>
</property>

<property>
        <name>yarn.log-aggregation-enable</name>
        <value>true</value>
</property>

<property>
        <name>yarn.nodemanager.log-aggregation.roll-monitoring-interval-seconds</name>
        <value>3600</value>
</property>

在下面的 spark-defaults.conf 下:

spark.master                     yarn
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://ds11:9000/spark-logs
spark.serializer                 org.apache.spark.serializer.KryoSerializer
spark.driver.memory              2048m
spark.executor.memory            1024m
spark.yarn.am.memory             1024m
spark.executor.instances         16
spark.executor.cores             4

spark.history.provider            org.apache.spark.deploy.history.FsHistoryProvider
spark.history.fs.logDirectory     hdfs://ds11:9000/spark-logs
spark.history.fs.update.interval  10s
spark.history.ui.port             18080

关键点是:

yarn.scheduler.minimum-allocation-mb

和

spark.executor.memory

如果我将 yarn.scheduler.minimum-allocation-mb 设置为 1537mb 或更高，则 Spark 无法为 Spark 作业分配容器。
因此，当我启动 Spark 时，我得到以下诊断信息:

2018-03-01 13:12:25,295 INFO yarn.Client: Will allocate AM container, with 1408 MB memory including 384 MB overhead
2018-03-01 13:12:25,296 INFO yarn.Client: Setting up container launch context for our AM
2018-03-01 13:12:25,299 INFO yarn.Client: Setting up the launch environment for our AM container
2018-03-01 13:12:25,306 INFO yarn.Client: Preparing resources for our AM container
2018-03-01 13:12:26,722 WARN yarn.Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.
2018-03-01 13:12:29,899 INFO yarn.Client: Uploading resource file:/tmp/spark-19cf3747-6949-4117-ba92-ccde71d8b473/__spark_libs__7526053733120768643.zip -> hdfs://ds11:9000/user/nw/.sparkStaging/application_1519906323717_0001/__spark_libs__7526053733120768643.zip
2018-03-01 13:12:32,082 INFO yarn.Client: Uploading resource file:/tmp/spark-19cf3747-6949-4117-ba92-ccde71d8b473/__spark_conf__171844339516087904.zip -> hdfs://ds11:9000/user/nw/.sparkStaging/application_1519906323717_0001/__spark_conf__.zip
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing view acls to: nw
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing modify acls to: nw
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing view acls groups to: 
2018-03-01 13:12:32,167 INFO spark.SecurityManager: Changing modify acls groups to: 
2018-03-01 13:12:32,167 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(nw); groups with view permissions: Set(); users  with modify permissions: Set(nw); groups with modify permissions: Set()
2018-03-01 13:12:32,175 INFO yarn.Client: Submitting application application_1519906323717_0001 to ResourceManager
2018-03-01 13:12:32,761 INFO impl.YarnClientImpl: Submitted application application_1519906323717_0001
2018-03-01 13:12:32,766 INFO cluster.SchedulerExtensionServices: Starting Yarn extension services with app application_1519906323717_0001 and attemptId None
2018-03-01 13:12:33,779 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)
2018-03-01 13:12:33,785 INFO yarn.Client: 
 client token: N/A
 diagnostics: [Thu Mar 01 13:12:32 +0100 2018] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1537, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; 
 ApplicationMaster host: N/A
 ApplicationMaster RPC port: -1
 queue: default
 start time: 1519906352464
 final status: UNDEFINED
 tracking URL: http://ds11:8088/proxy/application_1519906323717_0001/
 user: nw
2018-03-01 13:12:34,789 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)
2018-03-01 13:12:35,794 INFO yarn.Client: Application report for application_1519906323717_0001 (state: ACCEPTED)

当我将 yarn.scheduler.minimum-allocation-mb 设置为 1536mb 并将 spark.executor.memory 增加到例如 2048mb 时，我收到以下错误:

2018-03-01 15:15:47,578 ERROR spark.SparkContext: Error initializing SparkContext.
java.lang.IllegalArgumentException: Required executor memory (2048+384 MB) is above the max threshold (1536 MB) of this cluster! Please check the values of 'yarn.scheduler.maximum-allocation-mb' and/or 'yarn.nodemanager.resource.memory-mb'.
at org.apache.spark.deploy.yarn.Client.verifyClusterResources(Client.scala:319)
at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:167)
at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:56)
at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:173)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:509)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2516)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:918)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:910)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:910)
at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:31)
at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:775)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:205)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:119)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

当我增加这两个参数时，我仍然得到第一个错误类型，即 Spark 无法分配容器。

也许有人对这个问题有想法？

最佳答案

听起来您只是在 Spark 客户端上编辑 yarn-site。

如果您想更改实际的 YARN ResourceManager 和 NodeManager 内存大小，那么您需要在整个集群中rsync 该文件，然后重新启动 YARN 服务。

附言如果您还没有安装 HA ResourceManager

关于hadoop - Spark 的 Yarn 集群优化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49051872/

hadoop - Spark 的 Yarn 集群优化

上一篇：hadoop - 升级后连接到 Hive 中的 Metastore

下一篇：hadoop - 所有任务尝试都已完成，但 mapreduce 中的作业失败