Azure ML Workbench Kubernetes 部署失败

标签 azure azure-machine-learning-service

我正在尝试使用本教程中的集群模式的 ML Workbench 进程将预测 Web 服务部署到 Azure ( https://learn.microsoft.com/en-us/azure/machine-learning/preview/tutorial-classifying-iris-part-3#prepare-to-operationalize-locally )

模型被发送到 list 、评分脚本和架构

Creating service..........................................................Error occurred: {'Error': {'Code': 'KubernetesDeploymentFailed', 'Details': [{'Message': 'Back-off 40s restarting failed container=...pod=...', 'Code': 'CrashLoopBackOff'}], 'StatusCode': 400, 'Message': 'Kubernetes Deployment failed'}, 'OperationType': 'Service', 'State':'Failed', 'Id': '...', 'ResourceLocation': '/api/subscriptions/...', 'CreatedTime': '2017-10-26T20:30:49.77362Z','EndTime': '2017-10-26T20:36:40.186369Z'}

这是检查ml服务实时日志的结果

C:\Users\userguy\Documents\azure_ml_workbench\projecto>az ml service logs realtime -i projecto
2017-10-26 20:47:16,118 CRIT Supervisor running as root (no user in config file)
2017-10-26 20:47:16,120 INFO supervisord started with pid 1
2017-10-26 20:47:17,123 INFO spawned: 'rsyslog' with pid 9
2017-10-26 20:47:17,124 INFO spawned: 'program_exit' with pid 10
2017-10-26 20:47:17,124 INFO spawned: 'nginx' with pid 11
2017-10-26 20:47:17,125 INFO spawned: 'gunicorn' with pid 12
2017-10-26 20:47:18,160 INFO success: rsyslog entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:18,160 INFO success: program_exit entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2017-10-26 20:47:22,164 INFO success: nginx entered RUNNING state, process has stayed up for > than 5 seconds (startsecs)
2017-10-26T20:47:22.519159Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting gunicorn 19.6.0
2017-10-26T20:47:22.520097Z, INFO, 00000000-0000-0000-0000-000000000000, , Listening at: http://127.0.0.1:9090 (12)
2017-10-26T20:47:22.520375Z, INFO, 00000000-0000-0000-0000-000000000000, , Using worker: sync
2017-10-26T20:47:22.521757Z, INFO, 00000000-0000-0000-0000-000000000000, , worker timeout is set to 300
2017-10-26T20:47:22.522646Z, INFO, 00000000-0000-0000-0000-000000000000, , Booting worker with pid: 22
2017-10-26 20:47:27,669 WARN received SIGTERM indicating exit request
2017-10-26 20:47:27,669 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26T20:47:27.669556Z, INFO, 00000000-0000-0000-0000-000000000000, , Handling signal: term
2017-10-26 20:47:30,673 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:33,675 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
Initializing logger
2017-10-26T20:47:36.564469Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insights client
2017-10-26T20:47:36.564991Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up request id generator
2017-10-26T20:47:36.565316Z, INFO, 00000000-0000-0000-0000-000000000000, , Starting up app insight hooks
2017-10-26T20:47:36.565642Z, INFO, 00000000-0000-0000-0000-000000000000, , Invoking user's init function
2017-10-26 20:47:36.715933: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36,716 INFO waiting for nginx, gunicorn, rsyslog, program_exit to die
2017-10-26 20:47:36.716376: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instruc
tions, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716542: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructio
ns, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716703: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructi
ons, but these are available on your machine and could speed up CPU computations.
2017-10-26 20:47:36.716860: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructio
ns, but these are available on your machine and could speed up CPU computations.
this is the init
2017-10-26T20:47:37.551940Z, INFO, 00000000-0000-0000-0000-000000000000, , Users's init has completed successfully
Using TensorFlow backend.
2017-10-26T20:47:37.553751Z, INFO, 00000000-0000-0000-0000-000000000000, , Worker exiting (pid: 22)
2017-10-26T20:47:37.885303Z, INFO, 00000000-0000-0000-0000-000000000000, , Shutting down: Master
2017-10-26 20:47:37,885 WARN killing 'gunicorn' (12) with SIGKILL
2017-10-26 20:47:37,886 INFO stopped: gunicorn (terminated by SIGKILL)
2017-10-26 20:47:37,889 INFO stopped: nginx (exit status 0)
2017-10-26 20:47:37,890 INFO stopped: program_exit (terminated by SIGTERM)
2017-10-26 20:47:37,891 INFO stopped: rsyslog (exit status 0)

Received 41 lines of log

我最好的猜测是发生了一些无声的事情导致“警告收到指示退出请求的SIGTERM”。 Scoring.py 脚本的其余部分似乎已开始 - 请参阅 tensorflow 启动和“这是 init”打印语句。

http://127.0.0.1:63437可以从我的本地计算机访问,但 ui 端点为空。

关于如何在 Azure 集群中启动并运行它,有什么想法吗?我不太熟悉 Kubernetes 的工作原理,因此任何基本的调试指南将不胜感激。

最佳答案

我们在系统中发现了一个可能导致此问题的错误。该修复已于昨晚部署。您能否再试一次,如果仍然遇到此问题请告诉我们?

关于Azure ML Workbench Kubernetes 部署失败,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/46963846/

相关文章:

azure - 创建应用服务托管证书显示成功,但证书未出现

powershell - 如果没有交互式提示,则无法登录 Azure Powershell

azure - 服务主体如何登录我的 Azure 应用程序服务?

azure - 插件问题 : Azure Active Directory single sign-on (SSO) integration with JIRA SAML SSO by Microsoft

python - 如何从 Azure ML 中的 python 脚本获取 Web 服务的输出

azure 数据工厂更新

Azure ML 计算实例 : How can I safely upgrade the default Azure Ubuntu 16. 04 LTS 到最新的 LTS?

azure - 检索 Azure ML v2 的当前作业

javascript - 带 SSL 的 Azure 辅助角色上的 Node.js 会导致 ERR_SSL_PROTOCOL_ERROR

azure - Web 服务部署 Azure ML