amazon-ec2 - Ray 未在 EC2 上启动 worker

标签 amazon-ec2 parallel-processing cluster-computing hpc ray

我正在使用 Ray 模块在 AWS EC2 上启动 Ubuntu (16.04) 集群。在配置中,我将 min_workers、max_workers 和 initial_workers 指定为 2,因为我不需要任何自动调整大小。我还想要一个 t2.micro 主节点和 c4.8xlarge worker 。集群启动,但只有主节点(以下终端输出是从 ray 安装开始的,......减去细节):-

2019-04-18 14:52:48,462 INFO updater.py:268 -- NodeUpdater: Running pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting ray==0.7.0.dev2 from https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl
Downloading https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl (56.2MB)
.....
.....
Successfully built pyyaml
Installing collected packages: click, colorama, six, redis, typing, filelock, flatbuffers, numpy, pyyaml, more-itertools, setuptools, attrs, atomicwrites, pluggy, py, pathlib2, pytest, funcsigs, ray
Successfully installed atomicwrites attrs click colorama filelock flatbuffers funcsigs more-itertools numpy pathlib2 pluggy py pytest pyyaml-3.11 ray redis setuptools-20.7.0 six-1.10.0 typing
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:32,656 INFO updater.py:268 -- NodeUpdater: Running pip3    install boto3==1.4.8 on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
Collecting boto3==1.4.8
Downloading https://files.pythonhosted.org/packages/7d/09/66fef826fb13a2cee74a1df56c269d2794a90ece49c3b77113b733e4b91d/boto3-1.4.8-
....
....
Installing collected packages: docutils, jmespath, six, python-dateutil, botocore, s3transfer, boto3
Successfully installed boto3-1.4.8 botocore-1.8.50 docutils-0.14 jmespath-0.9.4 python-dateutil-2.8.0 s3transfer-0.1.13 six-1.12.0
You are using pip version 8.1.1, however version 19.0.3 is available.
You should consider upgrading via the 'pip install --upgrade pip' command.
2019-04-18 14:53:37,805 INFO updater.py:268 -- NodeUpdater: Running ray stop on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
WARNING: Not monitoring node memory since `psutil` is not installed.  Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 14:53:39,775 INFO updater.py:268 -- NodeUpdater: Running ulimit -n 65536; ray start --head --redis-port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml on 54.226.178.23...
bash: cannot set terminal process group (-1): Inappropriate ioctl for device
bash: no job control in this shell
2019-04-18 18:53:40,167 INFO scripts.py:288 -- Using IP address 172.31.7.117 for this node.
2019-04-18 18:53:40,167 INFO node.py:469 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,271 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:6379 to respond...
2019-04-18 18:53:40,389 INFO services.py:407 -- Waiting for redis server at 127.0.0.1:60491 to respond...
2019-04-18 18:53:40,390 INFO services.py:804 -- Starting Redis shard with 0.21 GB max memory.
2019-04-18 18:53:40,400 INFO node.py:483 -- Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-04-18_18-53-40_7981/logs.
2019-04-18 18:53:40,410 INFO services.py:1439 -- Starting the Plasma object store with 0.31 GB memory using /dev/shm.
2019-04-18 18:53:40,421 WARNING services.py:907 -- Failed to start the reporter. The reporter requires 'pip install psutil'.
WARNING: Not monitoring node memory since `psutil` is not installed. Install this with `pip install psutil` (or ray[debug]) to enable debugging of memory-related crashes.
2019-04-18 18:53:40,425 INFO scripts.py:319 -- 
Started Ray on this node. You can add additional nodes to the cluster by calling

    ray start --redis-address 172.31.7.117:6379

from the node you wish to add. You can connect a driver to the cluster from Python by running

import ray
ray.init(redis_address="172.31.7.117:6379")

If you have trouble connecting from a different machine, check that your firewall is configured properly. If you wish to terminate the processes that have been started, run

ray stop
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Setup commands completed [LogTimer=115941ms]
2019-04-18 14:53:40,593 INFO log_timer.py:21 -- NodeUpdater: i-064f62badf69f8cee: Applied config 248f16e493ac5bcd753a673eb7202fa2b49e0f9f  [LogTimer=173814ms]
2019-04-18 14:53:40,973 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-node-status=up-to-date on ['i-064f62badf69f8cee'] [LogTimer=374ms]
2019-04-18 14:53:41,069 INFO commands.py:264 -- get_or_create_head_node:  Head node up-to-date, IP address is: 54.226.178.23
To monitor auto-scaling activity, you can run:

  ray exec ray_config.yaml  'tail -n 100 -f /tmp/ray/session_*/logs/monitor*'

To open a console on the cluster:

  ray attach ray_config.yaml

To ssh manually to the cluster, run:

  ssh -i /home/haines/.ssh/ray-autoscaler_us-east-1.pem ubuntu@54.226.178.23

2019-04-18 14:53:41,181 INFO log_timer.py:21 -- AWSNodeProvider: Set tag ray-runtime-config=248f16e493ac5bcd753a673eb7202fa2b49e0f9f on ['i-064f62badf69f8cee'] 

我使用标准配置 (example-full.yaml) 并进行了以下更改:-

min_workers: 2

initial_workers: 2

    type: aws
    region: us-east-1
    availability_zone: us-east1a,us-east-1b


head_node:
    InstanceType: t2.micro
    ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212

worker_nodes:
    InstanceType: c4.8xlarge
    ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114  

        #MarketType: spot

setup_commands:

- echo 'export PATH="$HOME/anaconda3/envs/tensorflow_p36/bin:$PATH"' >>     ~/.bashrc
    - sudo apt-get update
    - sudo apt-get install python3-pip
    - pip3 install -U https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl

    - pip3 install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

最近失败的设置:-

setup_commands:
- sudo apt-get update
- wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true 1>/dev/null
- bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true 1>/dev/null
- echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
- sudo pkill -9 apt-get || true
- sudo pkill -9 dpkg || true
- sudo dpkg --configure -a
- sudo apt-get install python3-pip || true
- pip3 install --upgrade pip
- pip3 install --user psutil
- pip3 install --user proctitle
- pip3 install --user ray
- pip3 install --user boto3==1.4.8
- pip3 install --user https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-0.7.0.dev2-cp35-cp35m-manylinux1_x86_64.whl

最佳答案

我运行了你发布的配置的一个稍微修改过的版本,这对我有用

cluster_name: test

min_workers: 2

initial_workers: 2

provider:
    type: aws
    region: us-east-1
    availability_zone: us-east1a,us-east-1b

head_node:
    InstanceType: t2.micro
    ImageId: ami-0565af6e282977273 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20190212

worker_nodes:
    InstanceType: c4.8xlarge
    ImageId: ami-0f9cf087c1f27d9b1 # ubuntu/images/hvm-ssd/ubuntu-xenial-16.04-amd64-server-20181114
        #MarketType: spot

setup_commands:
    - sudo apt-get update
    # Install Anaconda.
    - wget https://repo.continuum.io/archive/Anaconda3-5.0.1-Linux-x86_64.sh || true
    - bash Anaconda3-5.0.1-Linux-x86_64.sh -b -p $HOME/anaconda3 || true
    - echo 'export PATH="$HOME/anaconda3/bin:$PATH"' >> ~/.bashrc
    # Install Ray.
    - pip install ray
    - pip install boto3==1.4.8  # 1.4.8 adds InstanceMarketOptions

我认为唯一真正的区别是安装 Anaconda Python 并将其放入 PATH 以便 pip 正确找到它。我怀疑这个问题与找不到正确的 Python 版本有关。

关于amazon-ec2 - Ray 未在 EC2 上启动 worker ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55660635/

相关文章:

c++ - 按照创建顺序读取/写入 HDF5 文件

c - 并非所有数组元素都发送

hadoop - 在 docker 容器上运行 hadoop 集群

php - curl_multi_exec 性能问题

amazon-web-services - 将 Google 域链接到 Amazon ec2 服务器

amazon-web-services - 将域分配给 RDS 实例是个坏主意?

amazon-web-services - apt-get 在 docker-container 上超时

java - 如何在 Web 服务器集群上生成唯一 ID

hadoop - 从节点不在 Yarn ResourceManager 中

php - 完全在服务器端运行 PHP 脚本