hadoop - 在 Yarn 客户端上运行 Spark

标签 hadoop apache-spark hadoop-yarn

我最近设置了一个多节点 Hadoop HA(Namenode 和 ResourceManager)集群(3 节点),安装完成并且所有守护进程都按预期运行

NN1 中的守护进程:

2945 JournalNode
3137 DFSZKFailoverController
6385 Jps
3338 NodeManager
22730 QuorumPeerMain
2747 DataNode
3228 ResourceManager
2636 NameNode

NN2 中的守护进程:

19620 Jps
3894 QuorumPeerMain
16966 ResourceManager
16808 NodeManager
16475 DataNode
16572 JournalNode
17101 NameNode
16702 DFSZKFailoverController

DN1 中的守护进程:

12228 QuorumPeerMain
29060 NodeManager
28858 DataNode
29644 Jps
28956 JournalNode

我有兴趣在我的 Yarn 设置上运行 Spark Jobs。 我已经在我的 NN1 上安装了 Scala 和 Spark,我可以通过发出以下命令成功启动我的 spark

$ spark-shell

现在,我对 SPARK 一无所知,我想知道如何在 Yarn 上运行 Spark。我读到我们可以将它作为 yarn-client 或 yarn-cluster 运行。

我应该在集群(NN2 和 DN1)的所有节点上安装 spark 和 scala 以在 Yarn 客户端或集群上运行 spark 吗?如果否,那么我如何从 NN1(主名称节点)主机提交 Spark 作业。

我已按照我阅读的博客中的建议将 Spark 程序集 JAR 复制到 HDFS,

-rw-r--r--   3 hduser supergroup  187548272 2016-04-04 15:56 /user/spark/share/lib/spark-assembly.jar

还在我的 bashrc 文件中创建了 SPARK_JAR 变量。我尝试将 Spark 作业作为 yarn-client 提交,但我最终遇到如下错误,我不知道我是否完全正确或需要其他设置先完成。

[hduser@ptfhadoop01v spark-1.6.0]$ ./bin/spark-submit --class     org.apache.spark.examples.SparkPi --master yarn  --deploy-mode client --driver-memory 4g --executor-memory 2g --executor-cores 2 --queue thequeue lib/spark-examples*.jar 10
16/04/04 17:27:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
16/04/04 17:27:51 WARN SparkConf:
SPARK_WORKER_INSTANCES was detected (set to '2').
This is deprecated in Spark 1.0+.

Please instead use:
 - ./spark-submit with --num-executors to specify the number of executors
 - Or set SPARK_EXECUTOR_INSTANCES
 - spark.executor.instances to configure the number of instances in the spark config.

16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.  This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:54 WARN Client: SPARK_JAR detected in the system environment.   This variable has been deprecated in favor of the spark.yarn.jar configuration variable.
16/04/04 17:27:57 ERROR SparkContext: Error initializing SparkContext.
org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at    org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at   org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at   org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
16/04/04 17:27:58 WARN MetricsSystem: Stopping a MetricsSystem that is not running
Exception in thread "main" org.apache.spark.SparkException: Yarn application has already ended! It might have been killed or unable to launch application master.
    at   org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.waitForApplication(YarnClientSchedulerBackend.scala:124)
    at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:64)
    at   org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
    at org.apache.spark.SparkContext.<init>(SparkContext.scala:530)
    at org.apache.spark.examples.SparkPi$.main(SparkPi.scala:29)
    at org.apache.spark.examples.SparkPi.main(SparkPi.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:497)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
[hduser@ptfhadoop01v spark-1.6.0]$

请帮助我解决这个问题以及如何在 Yarn 上以客户端或集群模式运行 Spark。

最佳答案

Now , i have no knowledge about SPARK , i would like to know how can i run Spark on Yarn. I have read that we can run it as either yarn-client or yarn-cluster.

强烈建议您阅读 Spark on YARN 的官方文档,网址为 http://spark.apache.org/docs/latest/running-on-yarn.html .

您可以使用 spark-shell--master yarn 连接到 YARN。你需要在你执行 spark-shell 的机器上有正确的配置文件,例如yarn-site.xml.

Should i install the spark & scala on all nodes in the Cluster (NN2 & DN1) to run spark on Yarn client or cluster ?

没有。您无需在 YARN 上安装任何东西,因为 Spark 会为您分发必要的文件。

If No then how can i submit the Spark Jobs from NN1 (Primary namenode) host.

spark-shell --master yarn开始,看看能不能执行下面的代码:

(0 to 5).toDF.show

如果您看到类似表格的输出,那么您就完成了。否则,请提供错误。

Also created SPARK_JAR variable in my bashrc file.I tried to submit the Spark Job as yarn-client but i end up with error as below , I have no idea on if i am doing it all correct or need other settings to be done first.

删除 SPARK_JAR 变量。不要使用它,因为它不需要并且可能会引起麻烦。在 http://spark.apache.org/docs/latest/running-on-yarn.html 阅读官方文档了解 Spark on YARN 及其他方面的基础知识。

关于hadoop - 在 Yarn 客户端上运行 Spark,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/36399635/

相关文章:

hadoop - 增加hdfs java堆内存的正确方法

Sqoop 安装的 Hadoop 问题

python - Pyspark 使用 sql.transform 使包含结构数组的列中的所有空字符串无效

java - 在 Databricks 作业集群上安装 Maven 包

java - 如何在集群模式下的spark commit中找到当前的暂存目录?

java - 使用 ProcessBuilder 从 java 程序运行 yarn 作业给出文件不存在错误

r - 如何测试 R 是否作为 Rscript 运行?

oracle - 在 Hive SQL 中为每个 ID 查找 3 分钟组

hadoop - 将文件拆分为小文件并在缩减阶段生成这些文件的名称的映射作业

cassandra - 将 Cassandra 与 Spark (pyspark) 连接/集成