apache-spark - 使用的 YARN vCores : Spark on YARN

标签 apache-spark hadoop pyspark hadoop-yarn

我正在使用以下配置在 YARN 上提交 spark 应用程序

conf.set("spark.executor.cores", "3")
conf.set("spark.executor.memory", "14g")
conf.set("spark.executor.instances", "4")
conf.set("spark.driver.cores", "5")
conf.set("spark.driver.memory", "1g")

但是，在 YARN 资源管理器 UI 上，它显示 vCores used = 5，我预计 vCores 曾经是 17 ((4x3)+5=17) 即 12执行人和 5 驱动程序。但它总是显示等于 executors+driver=5。

请帮助我理解这一点! 提前致谢

最佳答案

在 Spark 配置docs你会看到以下内容:

Spark properties mainly can be divided into two kinds: one is related to deploy, like “spark.driver.memory”, “spark.executor.instances”, this kind of properties may not be affected when setting programmatically through SparkConf in runtime, or the behavior is depending on which cluster manager and deploy mode you choose, so it would be suggested to set through configuration file or spark-submit command line options; another is mainly related to Spark runtime control, like “spark.task.maxFailures”, this kind of properties can be set in either way.

与在代码中相反，您将希望从 spark-submit 命令行设置大部分设置。无论如何，这通常是一种更好的做法，这样您就可以使用不同的参数启 Action 业，而无需重新编译它。

你会想要这样的东西:

spark-submit --num-executors 4 --executor-cores 3 --executor-memory 14g --driver-memory 1g --driver-cores 5 --class <main_class> <your_jar>

关于apache-spark - 使用的 YARN vCores : Spark on YARN，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54982225/

上一篇：scala - Spark : Would a dataframe repartitioned to one node experience a shuffle when a groupBy is called on it?

下一篇：json - 如何将多个 Json 文件(可能有不同的字段)加载到一个配置单元表中

python - 连接 Spark Dataframe 中包含列表值的列

postgresql - Py4JJavaError java.lang.NullPointerException org.apache.spark.sql.DataFrameWriter.jdbc

python - Pyspark dataframe 获取至少一行满足条件的列列表

r - 如何使用 SparkR 从 delta lib 读取数据？

apache-spark - 局部敏感哈希的 Spark 实现

hadoop - hive unix_timestamp()UDF提供多个值

java - HiveContext.sql() 给出运行时没有这样的方法错误

java - 如何知道在cassandra中使用Spark插入的行数

python - 如何根据列索引列表从pyspark中的csv文件中选择某些列，然后确定它们的不同长度