python - 如何在Docker群中使用ray

标签 python docker dockerfile docker-swarm ray

我正在尝试建立一个集群,其中有一个射线头和两个带有docker swarm的射线 worker 。为此,我有三台机器,其中一台运行射线防护仪,另一台运行两台射线防护仪。群集运行正常,但是每当我执行到容器并运行时:

import ray
ray.init(redis-address='ray-head:6379')

我得到
WARNING worker.py:1274 -- Some processes that the driver needs to connect to have not registered with Redis, so retrying. Have you run 'ray start' on this node?

容器的原木没有问题。

我也有IP,光线头容器的机器和ip。
ray.init(redis-address='192.168.30.193:6379')

运行时:
telnet 192.168.30.193 6379

有一个答案。

容器的Dockerfile:
FROM python:2.7-slim

RUN apt-get -y update
RUN apt-get install -y --fix-missing \
    libxml2 \
    gcc \
    vim \
    iputils-ping \
    telnet \
    procps \
    && apt-get clean && rm -rf /tmp/* /var/tmp/*

RUN pip install ray

CMD ["echo", "Base Image Ready"]

docker-compose.yml
version: "3.5"

services:
  ray-head:
    image: simpled:0.1
    shm_size: '2gb'
    entrypoint: [ '/usr/local/bin/ray']
    command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'ray-head', '--block']
    ports:
      - target: 6379
        published: 6379
        protocol: tcp
        mode: host
      - target: 6380
        published: 6380
        protocol: tcp
        mode: host
      - target: 6381
        published: 6381
        protocol: tcp
        mode: host
      - target: 12345
        published: 12345
        protocol: tcp
        mode: host
      - target: 12346
        published: 12346
        protocol: tcp
        mode: host
    deploy:
      replicas: 1
      placement:
        constraints: [node.labels.Head == true ]
  ray-worker:
    image: simpled:0.1
    shm_size: '2gb'
    entrypoint: [ '/usr/local/bin/ray']
    command: ['start', '--node-ip-address', 'ray-worker', '--redis-address', 'ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']
    ports:
      - target: 12345
        published: 12345
        protocol: tcp
        mode: host
      - target: 12346
        published: 12346
        protocol: tcp
        mode: host
    depends_on:
      - "ray-head"
    deploy:
      mode: global
      placement:
        constraints: [node.labels.Head != true]

我做错了吗?任何使它能够以群集模式工作的人。

编辑2019-04-14

头日志:
[root@ray-node-1 bd-migratie-core]# docker service logs qaudt0j3clfv
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,187  INFO scripts.py:288 -- Using IP address 10.0.30.2 for this node.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,190  INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-34_1/logs.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,323  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6379 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,529  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6380 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,538  INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,704  INFO services.py:363 -- Waiting for redis server at 127.0.0.1:6381 to respond...
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,714  INFO services.py:760 -- Starting Redis shard with 0.74 GB max memory.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,859  WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,862  INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 2019-04-14 17:49:34,997  INFO scripts.py:319 -- 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | Started Ray on this node. You can add additional nodes to the cluster by calling
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    |     ray start --redis-address 10.0.30.2:6379
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | from the node you wish to add. You can connect a driver to the cluster from Python by running
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    |     import ray
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    |     ray.init(redis_address="10.0.30.2:6379")
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    | 
simpled_ray-head.1.n5ajmtfoftmw@ray-node-1.bidirection.se    |     ray stop

头箱内的ps aux:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.2  1.9 289800 70860 ?        Ss   17:49   0:01 /usr/local/bin/python /usr/local/bin/ray start --head --redis-port 6379 --redis-shard-ports 6380,6381 --object-manager-port 12345 --node-manager-port 12346 --node-ip-addres
root         9  0.9  1.4 182352 50920 ?        Rl   17:49   0:05 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6379
root        14  0.8  1.3 182352 48828 ?        Rl   17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6380
root        18  0.5  1.4 188496 52320 ?        Sl   17:49   0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/thirdparty/redis/src/redis-server *:6381
root        22  3.1  1.9 283144 70132 ?        S    17:49   0:17 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/monitor.py --redis-address=10.0.30.2:6379
root        23  0.7  0.0  15736  1852 ?        S    17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet_monitor 10.0.30.2 6379
root        25  0.0  0.0 1098804 1528 ?        S    17:49   0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-34_1/sockets/plasma_store -m 1111605657 -d /tmp
root        26  0.5  0.0  32944  2524 ?        Sl   17:49   0:03 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-34_1/sockets/p
root        27  1.1  0.9 246340 35192 ?        S    17:49   0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-34_1/logs
root        31  2.7  0.9 385800 35368 ?        Sl   17:49   0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        32  2.7  0.9 385800 35364 ?        Sl   17:49   0:15 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.2 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        48  2.2  0.0  19944  2232 pts/0    Ss   17:59   0:00 bash
root        53  0.0  0.0  38376  1644 pts/0    R+   17:59   0:00 ps aux

worker 日志:
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,716        INFO services.py:363 -- Waiting for redis server at 10.0.30.2:6379 to respond...
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,733        INFO scripts.py:363 -- Using IP address 10.0.30.5 for this node.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,748        INFO node.py:423 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-14_17-49-35_1/logs.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,794        WARNING services.py:1261 -- WARNING: The object store is using /tmp instead of /dev/shm because /dev/shm has only 67108864 bytes available. This may slow down performance! You may be able to free up space by deleting files in /dev/shm or terminating any running plasma_store_server processes. If you are inside a Docker container, you may need to pass an argument with the flag '--shm-size' to 'docker run'.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,796        INFO services.py:1384 -- Starting the Plasma object store with 1.11 GB memory using /tmp.
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 2019-04-14 17:49:35,894        INFO scripts.py:371 -- 
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | Started Ray on this node. If you wish to terminate the processes that have been started, run
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    | 
simpled_ray-worker.0.n0t4roion9h2@ray-node-2.bidirection.se    |     ray stop

PS辅助:
USER       PID %CPU %MEM    VSZ   RSS TTY      STAT START   TIME COMMAND
root         1  0.1  1.9 292524 70900 ?        Ss   17:49   0:01 /usr/local/bin/python /usr/local/bin/ray start --node-ip-address ray-worker --redis-address ray-head:6379 --object-manager-port 12345 --node-manager-port 12346 --block
root        10  0.0  0.0 1098804 1532 ?        S    17:49   0:00 /usr/local/lib/python2.7/site-packages/ray/core/src/plasma/plasma_store_server -s /tmp/ray/session_2019-04-14_17-49-35_1/sockets/plasma_store -m 1111605657 -d /tmp
root        11  0.5  0.0  32944  2520 ?        Sl   17:49   0:04 /usr/local/lib/python2.7/site-packages/ray/core/src/ray/raylet/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/raylet /tmp/ray/session_2019-04-14_17-49-35_1/sockets/p
root        12  0.8  0.9 246320 35192 ?        S    17:49   0:06 /usr/local/bin/python -u /usr/local/lib/python2.7/site-packages/ray/log_monitor.py --redis-address=10.0.30.2:6379 --logs-dir=/tmp/ray/session_2019-04-14_17-49-35_1/logs
root        15  2.7  0.9 385800 35368 ?        Sl   17:49   0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        16  2.7  0.9 385800 35360 ?        Sl   17:49   0:19 /usr/local/bin/python /usr/local/lib/python2.7/site-packages/ray/workers/default_worker.py --node-ip-address=10.0.30.5 --object-store-name=/tmp/ray/session_2019-04-14_17-49
root        39  4.5  0.0  19944  2236 pts/0    Ss   18:01   0:00 bash
root        44  0.0  0.0  38376  1648 pts/0    R+   18:01   0:00 ps aux

编辑2019-04-17

我知道现在不起作用的原因,但不知道如何解决。

如果我登录到头容器并检查ray进程正在运行的ip
ray/monitor.py --redis-address=10.0.30.5:6379

这个匹配
/# ping ray-head
PING ray-head (10.0.30.5) 56(84) bytes of data.
64 bytes from 10.0.30.5 (10.0.30.5): icmp_seq=1 ttl=64 time=0.105 ms

但这不匹配
/hostname -i
10.0.30.6

如果我将射线过程从--redis-address = 10.0.30.6:6379开始
有用。

最佳答案

我发现了解决方法:

ray-head容器的主机名不是“ray-head”,而是“tasks.ray-head”。

为了使其工作,我需要像这样更改docker-compose文件中的主机名:

对于射线头:

command: ['start', '--head', '--redis-port', '6379', '--redis-shard-ports','6380,6381', '--object-manager-port','12345', '--node-manager-port','12346', '--node-ip-address', 'tasks.ray-head', '--block']

对于射线 worker :
command: ['start', '--redis-address', 'tasks.ray-head:6379', '--object-manager-port', '12345', '--node-manager-port', '12346', '--block']

现在,我可以在任何主机上运行它:
ray.init('tasks.ray-head:6379')

我希望这可以帮助处于相同情况的其他人

关于python - 如何在Docker群中使用ray,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55653729/

相关文章:

python - 将 ASCII 输出转换为 UTF-8

python - 在创建表时执行 Django fixtures

python - 以任意方向在图像上随机书写文本(例如 : horizontal, 垂直,对角线(+45,-45))

python - 无服务器:使用私有(private) Python 包作为依赖

ssl - 将 artifactory 设置为 docker registry 时出现问题

docker - chown : changing ownership of '/var/lib/mysql/' : Operation not permitted

Docker Tomcat 镜像 : Encoding problem with Umlaute when deploying WAR file

sql-server - 拉取公共(public)镜像时 Kubernetes 中的 ImagePullBack pod 状态 (MS SQL Server Express)

ubuntu - 在 Docker 中运行 Jenkins - 立即退出

docker - redis.conf 在官方 docker 镜像中的位置是什么?