hadoop - 了解 Spark : Cluster Manager, Master 和 Driver 节点

读完这篇question , 我想再问一些问题:

集群管理器是一个长期运行的服务，它在哪个节点上运行？
主节点和驱动节点可能是同一台机器吗？我假设某处应该有一条规则说明这两个节点应该不同？
如果 Driver 节点出现故障，谁负责重新启动应用程序？究竟会发生什么？即主节点、集群管理器和工作节点将如何参与(如果他们参与)，以及以什么顺序参与？
与上一个问题类似:如果主节点发生故障，具体会发生什么情况以及谁负责从故障中恢复？

最佳答案

1. The Cluster Manager is a long-running service, on which node it is running?

Cluster Manager 是 Spark 独立模式下的主进程。它可以通过执行 ./sbin/start-master.sh 在任何地方启动，在 YARN 中它将是资源管理器。

2. Is it possible that the Master and the Driver nodes will be the same machine? I presume that there should be a rule somewhere stating that these two nodes should be different?

Master 是按集群的，Driver 是按应用程序的。对于独立/yarn 集群，Spark 目前支持两种部署模式。

在客户端模式下，驱动程序与提交应用程序的客户端在同一进程中启动。
在集群模式下，然而，对于standalone，驱动程序是从一个Worker和yarn启动的strong>，它在应用程序主节点内启动，客户端进程在完成提交应用程序的责任后立即退出，而无需等待应用程序完成。

如果应用程序在主节点中使用--deploy-mode client 提交，则Master 和Driver 将在同一节点上。检查deployment of Spark application over YARN

3. In the case where the Driver node fails, who is responsible for re-launching the application? And what will happen exactly? i.e. how the Master node, Cluster Manager and Workers nodes will get involved (if they do), and in which order?

如果驱动程序失败，所有执行程序任务都将被终止，用于提交/触发的 spark 应用程序。

4. In the case where the Master node fails, what will happen exactly and who is responsible for recovering from the failure?

主节点故障有两种处理方式。

ZooKeeper 的备用主节点:

Utilizing ZooKeeper to provide leader election and some state storage, you can launch multiple Masters in your cluster connected to the same ZooKeeper instance. One will be elected “leader” and the others will remain in standby mode. If the current leader dies, another Master will be elected, recover the old Master’s state, and then resume scheduling. The entire recovery process (from the time the first leader goes down) should take between 1 and 2 minutes. Note that this delay only affects scheduling new applications – applications that were already running during Master failover are unaffected. check here for configurations
使用本地文件系统的单节点恢复:

ZooKeeper is the best way to go for production-level high availability, but if you want to be able to restart the Master if it goes down, FILESYSTEM mode can take care of it. When applications and Workers register, they have enough state written to the provided directory so that they can be recovered upon a restart of the Master process. check here for conf and more details

关于hadoop - 了解 Spark : Cluster Manager, Master 和 Driver 节点，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34722415/

hadoop - 了解 Spark : Cluster Manager, Master 和 Driver 节点

上一篇：hadoop - Hbase mapreduce错误

下一篇：hadoop - 从 HDFS 导入数据到 HBase (cdh3u2)