ubuntu-14.04 - Mesosphere - 高可用性集群未能选举领导者,但日志显示没有错误并且似乎无法强制选举领导者

标签 ubuntu-14.04 apache-zookeeper mesos mesosphere marathon

我有一个 6 机器集群。这些机器是:

      HOST        MEM (GB) CPU
mesos-primary-1     8       2
mesos-primary-2     8       2
mesos-primary-3     8       2
mesos-worker-1      1       1
mesos-worker-2      1       1
mesos-worker-3      1       1

我的法定人数设置为 2。

主机的 id 分别为:1、2 和 3。
在 Web UI 中,我访问了 mesos-primary-1 的每个单独 IP , mesos-primary-2mesos-primary-3在端口 5050 上,我没有收到来自其中任何一个 IP 的重定向。

重定向的缺失让我相信,就好像每台机器都认为它拥有自己的法定人数或其他东西,这就是为什么它们无法看到彼此并选举领导者的原因。

访问港口8080在任何机器上都会出现错误,因为没有选出的领导者,但它确实解决了。
$ cat /etc/mesos-master/quorum
在每台主机上输出 2 个。

我也停止/重新启动了一切。在主节点上:
$ sudo service mesos-master stop\
sudo service marathon stop\
sudo service zookeeper stop\
sudo service mesos-master start\
sudo service marathon start\
sudo service zookeeper start

在每台从机上
$ sudo service mesos-slave stop\
sudo service mesos-slave start

仍然没有检测到任何奴隶,也没有选出领导者。

我的日志在所有 3 个 IP 上都是干净的(我得到了每个 IP,因为没有重定向),您可以在此处查看每个单独的 IP:

mesos-primary-1
Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying
I1002 11:01:06.547200 13743 http.cpp:321] HTTP GET for /master/state.json from 173.243.85.102:51963 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/45.0.2454.101 Safari/537.36'

mesos-primary-2
Log file created at: 2015/10/02 11:00:01
Running on machine: mesos-primary-2
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:01.532337 13722 logging.cpp:172] INFO level logging started!
I1002 11:00:01.532865 13722 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:01.532894 13722 main.cpp:231] Version: 0.24.1
I1002 11:00:01.532903 13722 main.cpp:234] Git tag: 0.24.1
I1002 11:00:01.532909 13722 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:01.533020 13722 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:01.546877 13722 leveldb.cpp:176] Opened db in 13.691496ms
I1002 11:00:01.550370 13722 leveldb.cpp:183] Compacted db in 2.522303ms
I1002 11:00:01.550559 13722 leveldb.cpp:198] Created db iterator in 118591ns
I1002 11:00:01.550618 13722 leveldb.cpp:204] Seeked to beginning of db in 1151ns
I1002 11:00:01.550642 13722 leveldb.cpp:273] Iterated through 0 keys in the db in 767ns
I1002 11:00:01.551029 13722 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:01.553994 13743 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:01.556193 13740 recover.cpp:449] Starting replica recovery
I1002 11:00:01.561755 13722 main.cpp:465] Starting Mesos master
I1002 11:00:01.563489 13740 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:01.568989 13722 master.cpp:378] Master 20151002-110001-2874854303-5050-13722 (159.203.90.171) started on 159.203.90.171:5050
I1002 11:00:01.569059 13722 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="159.203.90.171" --initialize_driver_logging="true" --ip="159.203.90.171" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:01.569535 13722 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:01.569581 13722 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:01.569608 13722 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:01.569718 13722 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:01.570199 13722 authenticator.cpp:512] Initializing server SASL
I1002 11:00:01.582969 13722 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:01.584786 13743 contender.cpp:149] Joining the ZK group
I1002 11:00:11.573873 13747 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

mesos-primary-3
Log file created at: 2015/10/02 11:00:12
Running on machine: mesos-primary-3
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I1002 11:00:12.609675 17105 logging.cpp:172] INFO level logging started!
I1002 11:00:12.610414 17105 main.cpp:229] Build: 2015-09-25 19:13:24 by root
I1002 11:00:12.610452 17105 main.cpp:231] Version: 0.24.1
I1002 11:00:12.610468 17105 main.cpp:234] Git tag: 0.24.1
I1002 11:00:12.610483 17105 main.cpp:238] Git SHA: 44873806c2bb55da37e9adbece938274d8cd7c48
I1002 11:00:12.610576 17105 main.cpp:252] Using 'HierarchicalDRF' allocator
I1002 11:00:12.618232 17105 leveldb.cpp:176] Opened db in 7.382537ms
I1002 11:00:12.619810 17105 leveldb.cpp:183] Compacted db in 1.512691ms
I1002 11:00:12.619876 17105 leveldb.cpp:198] Created db iterator in 27030ns
I1002 11:00:12.619910 17105 leveldb.cpp:204] Seeked to beginning of db in 1254ns
I1002 11:00:12.619925 17105 leveldb.cpp:273] Iterated through 0 keys in the db in 339ns
I1002 11:00:12.620028 17105 replica.cpp:744] Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
I1002 11:00:12.620930 17125 log.cpp:238] Attempting to join replica to ZooKeeper group
I1002 11:00:12.621615 17128 recover.cpp:449] Starting replica recovery
I1002 11:00:12.626735 17105 main.cpp:465] Starting Mesos master
I1002 11:00:12.627024 17128 recover.cpp:475] Replica is in EMPTY status
I1002 11:00:12.633635 17123 master.cpp:378] Master 20151002-110012-321094504-5050-17105 (104.131.35.19) started on 104.131.35.19:5050
I1002 11:00:12.633828 17123 master.cpp:380] Flags at startup: --allocation_interval="1secs" --allocator="HierarchicalDRF" --authenticate="false" --authenticate_slaves="false" --authenticators="crammd5" --authorizers="local" --framework_sorter="drf" --help="false" --hostname="104.131.35.19" --initialize_driver_logging="true" --ip="104.131.35.19" --log_auto_initialize="true" --log_dir="/var/log/mesos" --logbufsecs="0" --logging_level="INFO" --max_slave_ping_timeouts="5" --port="5050" --quiet="false" --quorum="2" --recovery_slave_removal_limit="100%" --registry="replicated_log" --registry_fetch_timeout="1mins" --registry_store_timeout="5secs" --registry_strict="false" --root_submissions="true" --slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" --user_sorter="drf" --version="false" --webui_dir="/usr/share/mesos/webui" --work_dir="/var/lib/mesos" --zk="zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos" --zk_session_timeout="10secs"
I1002 11:00:12.635736 17123 master.cpp:427] Master allowing unauthenticated frameworks to register
I1002 11:00:12.635771 17123 master.cpp:432] Master allowing unauthenticated slaves to register
I1002 11:00:12.635802 17123 master.cpp:469] Using default 'crammd5' authenticator
W1002 11:00:12.635835 17123 authenticator.cpp:505] No credentials provided, authentication requests will be refused.
I1002 11:00:12.636078 17123 authenticator.cpp:512] Initializing server SASL
I1002 11:00:12.643378 17125 contender.cpp:149] Joining the ZK group
I1002 11:00:12.643826 17123 master.cpp:1464] Successfully attached file '/var/log/mesos/mesos-master.INFO'
I1002 11:00:22.633390 17130 recover.cpp:111] Unable to finish the recover protocol in 10secs, retrying

我按照 this digital ocean guide 中给出的指南设置机器。 .

运行
MASTER=$(mesos-resolve `cat /etc/mesos/zk`) mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5”

Yields :
2015-10-02 12:30:26,137:14558(0x7f8dbb743700):ZOO_INFO@log_env@712: Client environment:zookeeper.version=zookeeper C client 3.4.5
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@716: Client environment:host.name=mesos-primary-1
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@723: Client environment:os.name=Linux
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@724: Client environment:os.arch=3.13.0-57-generic
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@725: Client environment:os.version=#95-Ubuntu SMP Fri Jun 19 09:28:15 UTC 2015
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@733: Client environment:user.name=root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@741: Client environment:user.home=/root
2015-10-02 12:30:26,141:14558(0x7f8dbb743700):ZOO_INFO@log_env@753: Client environment:user.dir=/root
2015-10-02 12:30:26,142:14558(0x7f8dbb743700):ZOO_INFO@zookeeper_init@786: Initiating client connection, host=159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181 sessionTimeout=10000 watcher=0x7f8dc3625610 sessionId=0 sessionPasswd=<null> context=0x7f8da8003960 flags=0
2015-10-02 12:30:26,142:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:26,144:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:26,145:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:26,147:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,484:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.35.19:2181]
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.35.19:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,485:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [104.131.117.124:2181]
2015-10-02 12:30:29,486:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [104.131.117.124:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
2015-10-02 12:30:29,487:14558(0x7f8db6eff700):ZOO_INFO@check_events@1703: initiated connection to server [159.203.90.171:2181]
2015-10-02 12:30:29,488:14558(0x7f8db6eff700):ZOO_ERROR@handle_socket_error_msg@1721: Socket [159.203.90.171:2181] zk retcode=-4, errno=112(Host is down): failed while receiving a server response
Failed to detect master from 'zk://159.203.90.171:2181,104.131.35.19:2181,104.131.117.124:2181/mesos' within 5secs
root@mesos-primary-1:~# mesos-execute --master=$MASTER --name="cluster-test" --command="sleep 5"`

有没有人有任何想法?

最佳答案

在我看来,您的机器似乎都是 无法到达 来自彼此,或 端口被阻止 在正确端口上的部分或全部机器上。确保这件事:

答:端口在 2181(zookeeper)、2888 和 3888(分别为从属加入和主选举)和 5050(mesos)/8080(如果您使用的是马拉松)上为您的桌面/笔记本电脑的 UI 解锁。从站只需要 2888 我相信可以从主站访问。

B. 您可以先从一台机器 ping 所有其他主机,即使用主机 1 并 ping 主机 2 和 3。

C.在担心从属之前,请先尝试正确调试主节点形成集群。

您似乎在这里有一套很好的配置和正确的仲裁设置,一旦您确定机器可以相互连接,您就可以调查其他潜在问题。让我们知道怎么回事!

关于ubuntu-14.04 - Mesosphere - 高可用性集群未能选举领导者,但日志显示没有错误并且似乎无法强制选举领导者,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32910539/

相关文章:

build - 如何在 kdevelop 中使用 Crypto++ 库构建项目

c - 如何在sublime text 3中编译运行C?

java - 由于内存不足错误,无法在 vagrant ubuntu 盒子中启动 Zookeeper 服务器

postgresql - 无法在 Ubuntu 上建立 Postgres pgAdmin SSH 隧道

java - 无法写入 Java 中的文件

mesos - 为什么马拉松在失去法定人数后不终止工作?

hadoop - 多节点hadoop集群和mesos上跑hadoop有什么区别?

celery - Airflow :何时使用 CeleryExecutor 何时使用 MesosExecutor

apache-kafka - kafka为什么不创建主题? bootstrap-server不是公认的选项

java - 如何通过 Java 在 Kafka 中创建主题