我制作、训练并保存了一个简单的 tf.keras 模型。然后，我通过 Flask、redis 和 rq 设置了一个基于任务的基本 API。

它的工作原理基本上是这样的:

使用输入调用 api
任务(使用模型评估输入)已排队
检查任务状态直至完成。

在 docker 之外，这工作得很好。

我正在使用 docker-compose 来启动 redis 服务器、工作服务和 Flask api。

flask 和辅助服务(评估模型)是从 Dockerfile 构建的，从 FROM tensorflow/tensorflow:1.15.0rc2-gpu-py3-jupyter 或 FROM debian 开始:buster-slim。

虽然在任何一种情况下都没有检测到 GPU，但问题源于加载的模型不想在 CPU 上运行(它确实在 docker 之外工作)。这很有趣，因为任务的一部分调用了多个 tensorflow 操作(例如将输入转换为与 tf.data 一起使用)。如果我仅注释掉模型的评估，但允许其他 tf 函数运行，则一切都会按预期进行。

当我通过 docker-compose 启动时，我看到的一些日志是:

worker | 2019-10-20 12:58:48.602164: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
worker | 2019-10-20 12:58:48.633121: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
worker | 2019-10-20 12:58:48.636200: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3a82e10 executing computations on platform Host. Devices:
worker | 2019-10-20 12:58:48.636236: I tensorflow/compiler/xla/service/service.cc:175]   StreamExecutor device (0): <undefined>, <undefined>

当任务是队列worker日志时:

worker | 2019-10-20 12:59:47.726320: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.

看看其他SO问题，我知道如果我有GPU，我可以静音这个警告(即使使用tensorflow/tensorflow:1.15.0rc2-gpu-py3-jupyter，docker似乎也没有找到)。

我尝试添加指定的环境变量，例如

os.environ['TF_XLA_FLAGS'] = '--tf_xla_cpu_global_jit'

但随后它提示 ALA 等。

有趣的是，如果在任务框架之外调用(例如在 Flask views.py 文件中 print 结果)，docker 几乎会立即记录结果。

更新:实际上比这更奇怪。

考虑:

# views.py
# ...
print('toggle')

print('about to call')
results = evaluate_model(dummy_input) # model.predict(...)
print('called')
print(results)

当我调用docker-compose up时我明白了

[flask] toggle
[flask] about to call
[flask] called

然后，如果我注释掉 print('toggle') 我明白了

[flask] toggle #<--- should not see this, it is commented out
[flask] about to call
[flask] called
[flask] [[...], [...], ..., [...]] #<--- matrix

如果我取消注释 print('toggle') 我明白了

[flask] about to call # toggle should be printed but it isnt.
[flask] called
[flask] [[...], [...], ..., [...]] #<--- matrix

就好像正在运行的进程丢失了，过一会儿又找到了？请注意，这不会影响 api 的其他端点，即当加载 Flask 时，它不会卡住等待模型返回。

可以找到 docker 设置的 MWE here (不包括TF，只是flask、redis、rq和前端如何连接)

有什么想法吗？

更新

MWE设置。

如果您克隆此存储库并运行

docker-compose -f docker-compose.ai.development.yml build
docker-compose -f docker-compose.ai.development.yml up

可以看到超简单的notebook生成一个玩具 tf.keras 模型(直接来自 TF 文档)。

在笔记本中保存并加载此模型，以确保问题不是模型的导出/导入。

docker-compose -f docker-compose.web.development.yml build
docker-compose -f docker-compose.web.development.yml up

启动前端(nuxt + ngx)和后端(flask + rq + redis)。

令人感兴趣的是文件/backend/app/api/utils.py，它的任务非常简单:

import tensorflow as tf, numpy as np, os

model_file = os.path.join('/app/models/', 'model.h5')
model = tf.keras.models.load_model(model_file)

def predict_model():
    dummy_input = np.zeros((28, 28)).reshape((-1, 28, 28))
    predicted   = model.predict(dummy_input) # line 2 of predict_model function
    predicted = None
    return {
        'results': predicted
    }

前往 localhost:9061/model-task 提交任务 (predict_model)。注释掉第二行效果很好，将其保留在队列中会启动任务并且永远不会完成。

最佳答案

如果我正确理解问题，则可能是多线程问题。尝试在 ini 文件中设置 threads = 1，如果您愿意，您可以始终通过增加进程 数量来进行扩展，但不能增加线程 数量。

关于python - tensorflow 1.14+ : dockerized task based flask api doesn't run and or stalls?，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58473432/

python - tensorflow 1.14+ : dockerized task based flask api doesn't run and or stalls?

更新

上一篇：ckeditor - 如何使CKEditor完全重新初始化？

下一篇：angularjs - Angular $scope.$apply 与 $timeout 作为安全的 $apply