mysql - percona mysql xtradb集群启动不正常,节点重启不起作用

标签 mysql percona galera

tl;dr

当启动一个由 3 个 kubernetes pod 组成的全新 percona 集群时,grastate.dat seq_no 设置为 -1 并且不会更改.在删除一个 pod 并观察它重新启动时,期望它重新加入集群,它将它的初始位置设置为 00000000-0000-0000-0000-000000000000:-1 并尝试连接到它自己(它是以前的ip),也许是因为它是集群中的第一个 pod?然后它在与自身的错误连接中超时:

2017-03-26T08:38:05.374058Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S

集群没有正常启动,我无法成功重启集群中的 pod。

完整

当我从头开始启动集群时。有了空白的数据目录和一个新的 etcd 集群,一切似乎都出现了。但是,我查看了 grastate.dat,发现每个 pod 的 seq_no-1:

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0

此时我可以执行 mysql -h percona -u wordpress -p 并连接并且 wordpress 也可以工作。

场景: 我有 3 个 percona pod

/ # jonathan@ubuntu:~/Projects/k8wp$ kubectl get pods
NAME                         READY     STATUS    RESTARTS   AGE
etcd-0                       1/1       Running   1          12h
etcd-1                       1/1       Running   0          12h
etcd-2                       1/1       Running   3          12h
etcd-3                       1/1       Running   1          12h
percona-0                    1/1       Running   0          8m
percona-1                    1/1       Running   0          57m
percona-2                    1/1       Running   0          57m

当我尝试重启 percona-0 时,它在重启时被踢出集群,percona-0 的 gvwstate.dat 文件显示

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/gvwstate.dat
my_uuid: b7571ff8-11f8-11e7-bd2d-8b50487e1523
#vwbeg
view_id: 3 b7571ff8-11f8-11e7-bd2d-8b50487e1523 3
bootstrap: 0
member: b7571ff8-11f8-11e7-bd2d-8b50487e1523 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend

集群中的其他 2 个 pod 显示:

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/gvwstate.dat
my_uuid: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend
root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/gvwstate.dat
my_uuid: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a
#vwbeg
view_id: 3 bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 4
bootstrap: 0
member: bd05a643-11f8-11e7-9dab-1b4fc20eaf6a 0
member: c33d6a73-11f8-11e7-9e86-fe1cf3d3367a 0
#vwend

以下是我认为 percona-0 启动时的相关错误:

2017-03-26T08:37:58.370605Z 0 [Note] WSREP: Setting initial position to 00000000-0000-0000-0000-000000000000:-1
2017-03-26T08:37:58.372537Z 0 [Note] WSREP: gcomm: connecting to group 'wordpress-001', peer '10.52.0.26:'
2017-03-26T08:38:01.373345Z 0 [Note] WSREP: (b7571ff8, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:38:01.373682Z 0 [Warning] WSREP: no nodes coming from prim view, prim not possible
2017-03-26T08:38:01.373750Z 0 [Note] WSREP: view(view_id(NON_PRIM,b7571ff8,5) memb {
    b7571ff8,0
} joined {
} left {
} partitioned {
})
2017-03-26T08:38:01.373838Z 0 [Note] WSREP: gcomm: connected
2017-03-26T08:38:01.373872Z 0 [Note] WSREP: Changing maximum packet size to 64500, resulting msg size: 32636
2017-03-26T08:38:01.373987Z 0 [Note] WSREP: Shifting CLOSED -> OPEN (TO: 0)
2017-03-26T08:38:01.374012Z 0 [Note] WSREP: Opened channel 'wordpress-001'
2017-03-26T08:38:01.374108Z 0 [Note] WSREP: Waiting for SST to complete.
2017-03-26T08:38:01.374417Z 0 [Note] WSREP: New COMPONENT: primary = no, bootstrap = no, my_idx = 0, memb_num = 1
2017-03-26T08:38:01.374469Z 0 [Note] WSREP: Flow-control interval: [16, 16]
2017-03-26T08:38:01.374491Z 0 [Note] WSREP: Received NON-PRIMARY.
2017-03-26T08:38:01.374560Z 1 [Note] WSREP: New cluster view: global state: :-1, view# -1: non-Primary, number of nodes: 1, my index: 0, protocol version -1

它尝试连接到 2017-03-26T08:37:58.372537Z 0 中的 10.52.0.26 的 ip [注意] WSREP: gcomm: connecting to group 'wordpress-001' , peer '10.52.0.26:' 实际上是 pods 之前的 ip,这是我在删除 percona-0 之前在 etcd 中做的 key 列表

/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/wordpress
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress-001/10.52.0.26
/pxc-cluster/wordpress-001/10.52.0.26/hostname
/pxc-cluster/wordpress-001/10.52.0.26/ipaddr

kubectl 删除 pods/percona-0 后:

/ # etcdctl ls --recursive
/pxc-cluster
/pxc-cluster/queue
/pxc-cluster/queue/wordpress
/pxc-cluster/queue/wordpress-001
/pxc-cluster/wordpress-001
/pxc-cluster/wordpress-001/10.52.1.46
/pxc-cluster/wordpress-001/10.52.1.46/ipaddr
/pxc-cluster/wordpress-001/10.52.1.46/hostname
/pxc-cluster/wordpress-001/10.52.2.33
/pxc-cluster/wordpress-001/10.52.2.33/ipaddr
/pxc-cluster/wordpress-001/10.52.2.33/hostname
/pxc-cluster/wordpress

同样在重启期间 percona-0 尝试注册到 etcd:

{"action":"create","node":{"key":"/pxc-cluster/queue/wordpress-001/00000000000000009886","value":"10.52.0.27","expiration":"2017-03-26T08:38:57.980325718Z","ttl":60,"modifiedIndex":9886,"createdIndex":9886}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/ipaddr","value":"10.52.0.27","expiration":"2017-03-26T08:38:28.01814818Z","ttl":30,"modifiedIndex":9887,"createdIndex":9887}}
{"action":"set","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27/hostname","value":"percona-0","expiration":"2017-03-26T08:38:28.037188157Z","ttl":30,"modifiedIndex":9888,"createdIndex":9888}}
{"action":"update","node":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"expiration":"2017-03-26T08:38:28.054726795Z","ttl":30,"modifiedIndex":9889,"createdIndex":9887},"prevNode":{"key":"/pxc-cluster/wordpress-001/10.52.0.27","dir":true,"modifiedIndex":9887,"createdIndex":9887}}

这是行不通的。

来自集群 percona-1 的第二个成员:

2017-03-26T08:37:44.069583Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') turning message relay requesting on, nonlive peers: tcp://10.52.0.26:4567 
2017-03-26T08:37:45.069756Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') reconnecting to b7571ff8 (tcp://10.52.0.26:4567), attempt 0
2017-03-26T08:37:48.570332Z 0 [Note] WSREP: (bd05a643, 'tcp://0.0.0.0:4567') connection to peer 00000000 with addr tcp://10.52.0.26:4567 timed out, no messages seen in PT3S
2017-03-26T08:37:49.605089Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspecting node: b7571ff8
2017-03-26T08:37:49.605276Z 0 [Note] WSREP: evs::proto(bd05a643, GATHER, view_id(REG,b7571ff8,3)) suspected node without join message, declaring inactive
2017-03-26T08:37:50.104676Z 0 [Note] WSREP: declaring c33d6a73 at tcp://10.52.2.33:4567 stable

新信息: 我再次重新启动 percona-0,这次它不知何故出现了!几次尝试后,我意识到 pod 需要重新启动两次才能出现,即在第一次删除它之后,它出现了上述错误,在第二次删除它之后它出现正常并与其他成员同步。会不会是因为它是集群中的第一个 pod?

我测试过删除其他 pod,但它们都恢复正常。

问题只在于 percona-0。

还有; 一次把所有的 Pod 都拿下来,如果我的节点崩溃了,那就是 Pod 根本不会恢复的情况!我怀疑这是因为没有状态被保存到 grastate.dat ,即 seq_no 保持 -1,即使全局 id 可能改变,pod 退出并关闭 mysqld,并出现以下错误:

jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-2 | grep ERROR
2017-03-26T11:20:25.795085Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:25.795276Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:25.795544Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:25.795618Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:25.795645Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:25.795693Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-1 | grep ERROR
2017-03-26T11:20:27.093780Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:27.093977Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:27.094145Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.1.49': -110 (Connection timed out)
2017-03-26T11:20:27.094200Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:27.094227Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.1.49) failed: 7
2017-03-26T11:20:27.094247Z 0 [ERROR] Aborting
jonathan@ubuntu:~/Projects/k8wp$ kubectl logs percona-0 | grep ERROR
2017-03-26T11:20:52.040214Z 0 [ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)
2017-03-26T11:20:52.040279Z 0 [ERROR] WSREP: gcs/src/gcs_core.cpp:gcs_core_open():208: Failed to open backend connection: -110 (Connection timed out)
2017-03-26T11:20:52.040385Z 0 [ERROR] WSREP: gcs/src/gcs.cpp:gcs_open():1437: Failed to open channel 'wordpress-001' at 'gcomm://10.52.2.36': -110 (Connection timed out)
2017-03-26T11:20:52.040437Z 0 [ERROR] WSREP: gcs connect failed: Connection timed out
2017-03-26T11:20:52.040471Z 0 [ERROR] WSREP: wsrep::connect(gcomm://10.52.2.36) failed: 7
2017-03-26T11:20:52.040508Z 0 [ERROR] Aborting

grastate.dat 删除所有 pod:

root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-0/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-1/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0
 root@gluster-3:/mnt/gfs/gluster_vol-1/mysql# cat percona-2/grastate.dat
# GALERA saved state
version: 2.1
uuid:    a91f70f2-11f8-11e7-8f3d-86c2e58790ac
seqno:   -1
safe_to_bootstrap: 0

不,gvwstate.dat

最佳答案

通过将容器中的入口点更改为以下脚本来修复它:

#!/bin/bash
sed -i \"s|safe_to_bootstrap.*:.*|safe_to_bootstrap:1|1\" /var/lib/mysql/grastate.dat; 
/entrypoint.sh --wsrep-new-cluster;

感谢https://www.claudiokuenzler.com/blog/494/galera-cluster-mysql-not-starting-failed-to-open-channel-reach-primary#.WNesDiF97Qo

问题是,当 3 个 pod 从崩溃中重启时,它们都遇到了以下错误:

[ERROR] WSREP: failed to open gcomm backend connection: 110: failed to reach primary view: 110 (Connection timed out)

这意味着(从链接中总结),因为所有 pod 都关闭了,所以第一个 pod(pod 由 statefulset 管理)出现并尝试重新连接到集群,但没有找到任何其他 pod它可以连接到的 pod,所以它下降,下一个 pod 出现尝试同样的事情,遇到同样的错误然后下降到等等等等

解决方案是让第一个 pod 在出现时启动一个新集群,然后所有后续的 pod 都会出现并找到要连接的节点。它仍然会提供所有数据。

因此,使用 percona xtradb 时,docker 容器的入口点如下所示:

exec mysqld --user=mysql --wsrep_cluster_name=$CLUSTER_NAME --wsrep_cluster_address="gcomm://$cluster_join" --wsrep_sst_method=xtrabackup-v2 --wsrep_sst_auth="xtrabackup:$XTRABACKUP_PASSWORD" --log-error=${DATADIR}error.log $CMDARG

因此,要让设置运行,我要做的就是将前面的参数 --wsrep-new-cluster 传递给/entrypoint.sh 文件,如下所示:

/entrypoint.sh --wsrep-new-cluster

附言// 我首先单独尝试了上面的方法,但我遇到了一个错误,指出要强制一个新的集群并使用该节点进行引导,我必须在 /var/lib/中将 safe_to_bootstrap 从 0 设置为 1 mysql/grastate.dat

关于mysql - percona mysql xtradb集群启动不正常,节点重启不起作用,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43027043/

相关文章:

php - Mysqli 连接来自 2 个不同数据库的表

mariadb - MariaDB 和 HAProxy(集群)的连接问题

mysql表关系应该是一对多还是一对一

mysql - 如何在 mysql 查询中解析 csv 文件

mysql - MySQL 5.0 上的 pt-online-schema-change Percona 命令问题

mysql - 如何为 Percona 监控和管理配置 MySQL

mysql - 有效地选择相关表中存在行的行

mysql - Galera Mysql wsrep_cluster_status 保持断开连接

mysql - 相互复制两个集群

mysql - 由于错误 150(外键约束),删除后无法重新创建 mysql 表!