python - Cloud Dataflow 写入 BigQuery Python 错误

标签 python google-bigquery google-cloud-dataflow apache-beam

我正在编写一个简单的 Beam 作业,将数据从 GCS 存储桶复制到 BigQuery。代码如下所示:

from apache_beam.options.pipeline_options import GoogleCloudOptions
import apache_beam as beam

pipeline_options = GoogleCloudOptions(flags=sys.argv[1:])
pipeline_options.project = PROJECT_ID
pipeline_options.region = 'us-west1'
pipeline_options.job_name = JOB_NAME
pipeline_options.staging_location = BUCKET + '/binaries'
pipeline_options.temp_location = BUCKET + '/temp'

schema = 'id:INTEGER,region:STRING,population:INTEGER,sex:STRING,age:INTEGER,education:STRING,income:FLOAT,statusquo:FLOAT,vote:STRING'
p = (beam.Pipeline(options = pipeline_options)
     | 'ReadFromGCS' >> beam.io.textio.ReadFromText('Chile.csv')
     | 'WriteToBigQuery' >> beam.io.WriteToBigQuery('project:tmp.dummy', schema = schema))

我们在 project 项目中写入表 tmp.dummy 的地方。这导致以下堆栈跟踪:

Traceback (most recent call last):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 151, in _run_module_as_main
    mod_name, loader, code, fname = _get_module_details(mod_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/runpy.py", line 101, in _get_module_details
    loader = get_loader(mod_name)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 464, in get_loader
    return find_loader(fullname)
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 474, in find_loader
    for importer in iter_importers(fullname):
  File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/pkgutil.py", line 430, in iter_importers
    __import__(pkg)
  File "WriteToBigQuery.py", line 49, in <module>
    | 'WriteToBigQuery' >> beam.io.WriteToBigQuery(str(PROJECT_ID + ':' + pipeline_options.write_file), schema = schema))
  File "/Users/mayansalama/Documents/GCP/gcloud_env/lib/python2.7/site-packages/apache_beam/io/gcp/bigquery.py", line 1337, in __init__
    self.table_reference = _parse_table_reference(table, dataset, project)
  File "/Users/mayansalama/Documents/GCP/gcloud_env/lib/python2.7/site-packages/apache_beam/io/gcp/bigquery.py", line 309, in _parse_table_reference
    if isinstance(table, bigquery.TableReference):
AttributeError: 'module' object has no attribute 'TableReference'

看起来某些导入在某处出错了;这是否可能是使用 GoogleCloudOptions 管道选项导致的?

最佳答案

我遇到了同样的错误。我意识到我安装了错误的 apache beam 包。您需要在安装 apache beam 时将 [gcp] 添加到 package-name。

sudo pip install apache_beam[gcp]

一些更多的可选安装来修复安装错误,你很高兴。

sudo pip install oauth2client==3.0.0
sudo pip install httplib2==0.9.2

关于python - Cloud Dataflow 写入 BigQuery Python 错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49912659/

相关文章:

java - 如何从位于 Google Storage 的文本文件写入 Bitquery 表?

sql - 谷歌 BigQuery : Using TABLE_QUERY if project_id contains a hyphen "-"

elasticsearch - 如何在 Google Cloud Dataflow 中控制 Parallel/GroupBy 阶段的位置

python - 使用 scrapy 发出 POST 请求

python - 循环遍历包含字典的嵌套列表

python - 用于 Python/Django 持续集成的 TeamCity

python - 如何在Python中对数组的特定元素求和

google-bigquery - BigQuery 数据中的每行大小限制?

java - 云数据流: reading entire text files rather than lines by line

google-cloud-dataflow - 在 Apache Beam 中按顺序触发窗口