python - Google Dataflow - 无法导入自定义 python 模块

我的 Apache Beam 管道实现了自定义 Transforms 和 ParDo 的 Python 模块，这些模块进一步导入我编写的其他模块。在本地运行器上，这工作正常，因为所有可用文件都在同一路径中可用。对于数据流运行程序，管道失败并出现模块导入错误。

如何使自定义模块可供所有数据流工作人员使用？请指教。

下面是一个例子:

ImportError: No module named DataAggregation

    at find_class (/usr/lib/python2.7/pickle.py:1130)
    at find_class (/usr/local/lib/python2.7/dist-packages/dill/dill.py:423)
    at load_global (/usr/lib/python2.7/pickle.py:1096)
    at load (/usr/lib/python2.7/pickle.py:864)
    at load (/usr/local/lib/python2.7/dist-packages/dill/dill.py:266)
    at loads (/usr/local/lib/python2.7/dist-packages/dill/dill.py:277)
    at loads (/usr/local/lib/python2.7/dist-packages/apache_beam/internal/pickler.py:232)
    at apache_beam.runners.worker.operations.PGBKCVOperation.__init__ (operations.py:508)
    at apache_beam.runners.worker.operations.create_pgbk_op (operations.py:452)
    at apache_beam.runners.worker.operations.create_operation (operations.py:613)
    at create_operation (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:104)
    at execute (/usr/local/lib/python2.7/dist-packages/dataflow_worker/executor.py:130)
    at do_work (/usr/local/lib/python2.7/dist-packages/dataflow_worker/batchworker.py:642)

最佳答案

问题可能是您没有将文件分组为一个包。 Beam 文档有 a section就在上面。

Multiple File Dependencies

Often, your pipeline code spans multiple files. To run your project remotely, you must group these files as a Python package and specify the package when you run your pipeline. When the remote workers start, they will install your package. To group your files as a Python package and make it available remotely, perform the following steps:
Create a setup.py file for your project. The following is a very basic setup.py file.
setuptools.setup(
    name='PACKAGE-NAME'
    version='PACKAGE-VERSION',
    install_requires=[],
    packages=setuptools.find_packages(),
)
Structure your project so that the root directory contains the setup.py file, the main workflow file, and a directory with the rest of the files.
root_dir/
    setup.py
    main.py
    other_files_dir/
See Juliaset for an example that follows this required project structure.
Run your pipeline with the following command-line option:
--setup_file /path/to/setup.py
Note: If you created a requirements.txt file and your project spans multiple files, you can get rid of the requirements.txt file and instead, add all packages contained in requirements.txt to the install_requires field of the setup call (in step 1).

关于python - Google Dataflow - 无法导入自定义 python 模块，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51262031/

python - Google Dataflow - 无法导入自定义 python 模块

Multiple File Dependencies

上一篇：python - 使用 tkinter Python 点击激活功能

下一篇：python - 将事件数据帧重新采样为 10 分钟间隔并对事件进行计数

python - Google Dataflow - 无法导入自定义 python 模块

Multiple File Dependencies

上一篇：python - 使用 tkinter Python 点击​​激活功能

下一篇：python - 将事件数据帧重新采样为 10 分钟间隔并对事件进行计数

上一篇：python - 使用 tkinter Python 点击激活功能