apache-spark - 只能在同一台机器上从 Spark 连接到 Mesos

标签 apache-spark mesos

我正在尝试在 Mesos 集群上运行 Spark。

当我从运行 Mesos master 的机器上运行 ./bin/spark-shell --master mesos://host:5050 时,一切正常。但是,如果我从不同的机器运行相同的命令,进程在尝试连接后最终会挂起:

I0825 07:30:10.184141 27380 sched.cpp:126] Version: 0.19.0
I0825 07:30:10.187476 27385 sched.cpp:222] New master detected at master@192.168.0.241:5050
I0825 07:30:10.187619 27385 sched.cpp:230] No credentials provided. Attempting to register without authentication

在 mesos master 上,我看到以下输出:
[...]
I0825 15:30:23.928402 23214 master.cpp:684] Giving framework 20140825-143817-4043352256-5050-23194-0002 0ns to failover
I0825 15:30:23.929033 23210 master.cpp:2849] Framework failover timeout, removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929095 23210 master.cpp:3344] Removing framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.929687 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):6831; disk(*):455983; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-31 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.935073 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: cpus(*):4; mem(*):15001; disk(*):917264; ports(*):[31000-32000]) on slave 20140822-144404-4043352256-5050-15999-29 from framework   20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938248 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):6823; disk(*):455991; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-32 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938356 23210 hierarchical_allocator_process.hpp:636] Recovered mem(*):512 (total allocatable: mem(*):4939; disk(*):457873; ports(*):[31000-32000]; cpus(*):4) on slave 20140822-144404-4043352256-5050-15999-28 from framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:23.938397 23210 hierarchical_allocator_process.hpp:362] Removed framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:27.952940 23215 http.cpp:452] HTTP request for '/master/state.json'
W0825 15:30:29.595441 23208 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-32 on slave 20140822-144404-4043352256-5050-15999-32 at slave(1)@192.168.0.233:5051 (cluster2)
W0825 15:30:29.596709 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-29 on slave 20140822-144404-4043352256-5050-15999-29 at slave(1)@192.168.0.241:5051 (cluster4)
W0825 15:30:29.615630 23213 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-31 on slave 20140822-144404-4043352256-5050-15999-31 at slave(1)@192.168.0.213:5051 (cluster3)
W0825 15:30:29.935130 23214 master.cpp:2718] Ignoring unknown exited executor 20140822-144404-4043352256-5050-15999-28 on slave 20140822-144404-4043352256-5050-15999-28 at slave(1)@192.168.0.212:5051 (cluster1)

作为奴隶输出的地方
[...]
I0825 15:30:08.450343   980 slave.cpp:1337] Asked to shut down framework 20140825-143817-4043352256-5050-23194-0002 by master@192.168.0.241:5050
I0825 15:30:08.455153   980 slave.cpp:1362] Shutting down framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:08.455401   980 slave.cpp:2698] Shutting down executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456045   982 slave.cpp:2768] Killing executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:13.456217   982 mesos_containerizer.cpp:992] Destroying container '37cc2b09-0e6d-4738-a837-7956367bba2b'
I0825 15:30:14.134845   977 mesos_containerizer.cpp:1108] Executor for container '37cc2b09-0e6d-4738-a837-7956367bba2b' has exited
I0825 15:30:14.135220   978 slave.cpp:2413] Executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002 has terminated with signal Killed
I0825 15:30:14.135356   978 slave.cpp:2552] Cleaning up executor '20140822-144404-4043352256-5050-15999-31' of framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135499   978 slave.cpp:2627] Cleaning up framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135627   976 status_update_manager.cpp:282] Closing status update streams for framework 20140825-143817-4043352256-5050-23194-0002
I0825 15:30:14.135571   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31/runs/37cc2b09-0e6d-4738-a837-7956367bba2b' for gc 6.99999843242074days in the future
I0825 15:30:14.135910   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002/executors/20140822-144404-4043352256-5050-15999-31' for gc 6.99999843187556days in the future
I0825 15:30:14.135980   975 gc.cpp:56] Scheduling '/tmp/mesos/slaves/20140822-144404-4043352256-5050-15999-31/frameworks/20140825-143817-4043352256-5050-23194-0002' for gc 6.99999843111111days in the future
I0825 15:31:04.450660   978 slave.cpp:2873] Current usage 60.67%. Max allowed age: 2.053113079446458days

有没有人见过类似的东西?

最佳答案

结果证明该问题不是由网络连接问题引起的,而是由此处概述的 Mesos 从属恢复策略引起的:http://mesos.apache.org/documentation/latest/slave-recovery/

由于一个无关的问题,我最初会将从站连接到主站并断开它们的连接,但是当我后来再次尝试连接从站时,它们被主站删除了。引用上面链接的文档:

A restarted slave should re-register with master within a timeout (currently, 75s). If the slave takes longer than this timeout to re-register, the master shuts down the slave, which in turn shuts down any live executors/tasks. Therefore, it is highly recommended to automate the process of restarting a slave (e.g, using monit).



我通过将奴隶与 --strict 连接来解决了这个问题。选项设置为 false .

关于apache-spark - 只能在同一台机器上从 Spark 连接到 Mesos,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25481282/

相关文章:

eclipse - 在 eclipse Spark scala 调试 session 中,在哪里可以找到 RDD 中的数据?

docker - Apache Mesos的Docker容器化器

apache-spark - Spark 在使用 Docker Mesos 集群进行身份验证时挂起

docker - docker容器如何在Mesos/Marathon设置中进行通信

kubernetes - 建立一个廉价的集群

python - 为什么我的 Spark SVM 总是预测相同的标签?

hadoop - 批处理模式中的 livy 抛出错误 Error : Only local python files are supported: Parsed arguments

apache-spark - 为什么向 Mesos 提交 Spark 应用程序失败并显示 "Could not parse Master URL: ' mesos ://localhost:505 0'"?

java - 为什么 Docker 容器中的 Spark 应用程序会失败并出现 OutOfMemoryError : Java heap space?

machine-learning - Apache Spark ALS协作过滤结果。他们没有道理