我正在尝试使用 docker swarm 运行一个应用程序。该应用程序旨在使用 docker swarm 在单台计算机上完全本地运行。
如果我通过 SSH 连接到服务器并运行 docker stack deploy 一切正常,如下所示运行 docker service ls
:
当此部署有效时,服务通常按以下顺序上线:
- 注册表(私有(private)注册表)
- Main(一个 Nginx 服务)和 Postgres
- 随机顺序的所有其他服务(所有 Node 应用程序)
我遇到的问题是重启。当我重新启动服务器时,我一直遇到服务失败的问题,结果如下:
我收到一些可能有用的错误。
在 Postgres 中:docker 服务日志 APP_NAME_postgres -f
:
在 Docker 日志中:sudo journalctl -fu docker.service
更新:2019 年 6 月 5 日
此外,根据 GitHub 问题 docker version
输出的请求:
Client:
Version: 18.09.5
API version: 1.39
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:43:57 2019
OS/Arch: linux/amd64
Experimental: false
Server: Docker Engine - Community
Engine:
Version: 18.09.5
API version: 1.39 (minimum version 1.12)
Go version: go1.10.8
Git commit: e8ff056
Built: Thu Apr 11 04:10:53 2019
OS/Arch: linux/amd64
Experimental: false
和docker info
输出:
Containers: 28
Running: 9
Paused: 0
Stopped: 19
Images: 14
Server Version: 18.09.5
Storage Driver: overlay2
Backing Filesystem: extfs
Supports d_type: true
Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
Swarm: active
NodeID: pbouae9n1qnezcq2y09m7yn43
Is Manager: true
ClusterID: nq9095ldyeq5ydbsqvwpgdw1z
Managers: 1
Nodes: 1
Default Address Pool: 10.0.0.0/8
SubnetSize: 24
Orchestration:
Task History Retention Limit: 5
Raft:
Snapshot Interval: 10000
Number of Old Snapshots to Retain: 0
Heartbeat Tick: 1
Election Tick: 10
Dispatcher:
Heartbeat Period: 5 seconds
CA Configuration:
Expiry Duration: 3 months
Force Rotate: 1
Autolock Managers: false
Root Rotation In Progress: false
Node Address: 192.168.0.47
Manager Addresses:
192.168.0.47:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: bb71b10fd8f58240ca47fbb579b9d1028eea7c84
runc version: 2b18fe1d885ee5083ef9f0838fee39b62d653e30
init version: fec3683
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.15.0-50-generic
Operating System: Ubuntu 18.04.2 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 3.68GiB
Name: oeemaster
ID: 76LH:BH65:CFLT:FJOZ:NCZT:VJBM:2T57:UMAL:3PVC:OOXO:EBSZ:OIVH
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false
Product License: Community Engine
WARNING: No swap limit support
最后,我的 docker swarm stack/compose 文件:
secrets:
jwt-secret:
external: true
pg-db:
external: true
pg-host:
external: true
pg-pass:
external: true
pg-user:
external: true
ssl_dhparam:
external: true
services:
accounts:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
JWT_SECRET_FILE: /run/secrets/jwt-secret
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-accounts:v0.8.0
secrets:
- source: jwt-secret
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
graphs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-graphs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
health:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-health:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
live-data:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
image: 127.0.0.1:5000/local-oee-master-live-data:v0.8.0
ports:
- published: 32000
target: 80
main:
depends_on:
- accounts
- graphs
- health
- live-data
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
MAIN_CONFIG_FILE: nginx.local.conf
image: 127.0.0.1:5000/local-oee-master-nginx:v0.8.0
ports:
- published: 80
target: 80
- published: 443
target: 443
modbus-logger:
depends_on:
- point-logs
- registry
deploy:
restart_policy:
condition: on-failure
environment:
CONTROLLER_ADDRESS: 192.168.2.100
SERVER_ADDRESS: http://point-logs
image: 127.0.0.1:5000/local-oee-master-modbus-logger:v0.8.0
point-logs:
depends_on:
- postgres
- registry
deploy:
restart_policy:
condition: on-failure
environment:
ENV_TYPE: local
PG_DB_FILE: /run/secrets/pg-db
PG_HOST_FILE: /run/secrets/pg-host
PG_PASS_FILE: /run/secrets/pg-pass
PG_USER_FILE: /run/secrets/pg-user
image: 127.0.0.1:5000/local-oee-master-point-logs:v0.8.0
secrets:
- source: pg-db
- source: pg-host
- source: pg-pass
- source: pg-user
postgres:
depends_on:
- registry
deploy:
restart_policy:
condition: on-failure
window: 120s
environment:
POSTGRES_PASSWORD: password
image: 127.0.0.1:5000/local-oee-master-postgres:v0.8.0
ports:
- published: 5432
target: 5432
volumes:
- /media/db_main/postgres_oee_master:/var/lib/postgresql/data:rw
registry:
deploy:
restart_policy:
condition: on-failure
image: registry:2
ports:
- mode: host
published: 5000
target: 5000
volumes:
- /mnt/registry:/var/lib/registry:rw
version: '3.2'
我尝试过的事情
- 操作:添加了 restart_policy > 窗口:120 秒
- 结果:无效果
- 行动:Postgres restart_policy > 条件:无 & crontab @reboot 重新部署
- 结果:无效果
- 行动:设置所有容器 stop_grace_period:2m
- 结果:无效果
当前的解决方法
目前,我已经找到了一个有效的解决方案,这样我就可以继续下一步了。我刚刚编写了一个名为 recreate.sh
的 shell 脚本,它将终止服务器的首次启动失败版本,等待它崩溃,然后“手动”再次运行 docker stack deploy。然后我将脚本设置为在启动时使用 crontab @reboot 运行。这适用于关机和重新启动,但我不接受这是正确的答案,所以我不会将其添加为一个。
最佳答案
在我看来,您需要检查是谁/什么杀死了 postgres 服务。从您发布的日志来看,postrgres 似乎收到了智能关机信号。然后,后压轻轻停止。您的堆栈文件已将重启策略设置为“on-failure”,并且由于 postres 进程缓慢停止(退出代码 0),docker 不会将此视为失败并按照指示不会重启。
总而言之,我建议将重启策略从“失败时”更改为“任何”。
此外,请记住,您使用的“depends_on”设置在 swarm 中会被忽略,您需要让您的服务/图像以自己的方式确保正确的启动顺序,或者在依赖服务尚未启动时能够工作。
您还可以尝试 - 健康检查。也许您的 postgres 基础镜像定义了健康检查,并且它通过向容器发送 kill 信号来终止容器。正如之前所写,postgres 温和地关闭并且没有错误退出代码并且重启策略不会触发。尝试在 yaml 中禁用 healthcheck 或转到 dockerfiles 查看 healthcheck 指令并找出它触发的原因。
关于postgresql - Postgres 无法在 swarm 服务器重启时启动,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56368213/