google-bigquery - Apache Beam BigQueryIO写入速度慢

标签 google-bigquery apache-beam

我的Beam管道正在写入未分区的BigQuery目标表。 PCollection由数百万个TableRows组成。如果我使用DirectRunner运行,BigQueryIO显然会首先为BigQueryWriteTemp临时文件夹中的每个记录创建一个临时文件。这显然表现不佳。我在这里做错什么了吗?这是正常的批处理作业,而不是流式处理。 (使用DataflowRunner运行的同一作业似乎没有执行此操作)

myrows.apply("WriteToBigQuery",
                BigQueryIO.writeTableRows().to(BQ_TARGET_TABLE)
                        .withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
                .withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));

这是我们正在查看的日志。这些文件中的每一个都只包含一个TableRow。在DataflowRunner上,似乎只能创建大约3个文件。
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/59668b03-a1e8-4288-a049-3472e7cb6333.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/feeb454b-799e-4d77-bd12-dec313cdadc2.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3c63db33-787f-4215-a425-3446d92157ed.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/87d55556-e012-4bef-8856-69efd4c5ab26.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5e6bfe94-b1c9-49cb-b0c7-a768d78d85f3.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/b236948b-bdf0-4bfe-9d26-4e67c8904320.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/451abb93-e02a-4210-aa46-5afa0c82547d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/60fd5ecc-8dbe-46e4-884d-3767694b009f.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/f3a5b4e0-e956-4a41-a78d-c7694950b6f1.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/a4e4c74f-d12c-495d-bf28-eb20ee25f086.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/eb3b29e1-cc0c-4a6d-82f4-8527d0c5a51e.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/916ac41b-4ece-42bb-bf24-c5ca17060d1d.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5b76128f-3c66-4701-92ce-2d3ba2e91f65.
2017-08-14 11:43:49 INFO  TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3a0ae709-756e-452c-9b0f-6efa9c0864ca.

最佳答案

Direct runner用于测试和开发,并包括其他检查以确保管道将在其他运行程序中正确运行。这带来了性能下降的副作用。
以下是其他检查:

  • 强制元素
  • 的不变性
  • 强制元素
  • 的可编码性
  • 元素在所有点以任意顺序处理
  • 用户功能(DoFn,CombinFn等)的序列化
  • 关于google-bigquery - Apache Beam BigQueryIO写入速度慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45617243/

    相关文章:

    google-bigquery - 我何时应该选择批处理分析而不是交互式分析?

    sql - 如何在 BigQuery SQL 中获取特定字符集的所有内容?

    google-bigquery - 如何使用UNNEST或任何其他函数展平数组?

    java - 使用 Beam/Dataflow 下拉每个元素上的 BigQuery 表架构很慢

    java - 无模式 JSON 到 Apache Beam "Row"类型?

    go - 确定 AppendRows 请求消息的大小

    google-cloud-dataflow - 如何在数据流/光束中将 PCollection<List<String>> 转换为 PCollection<String>

    java - 是否可以在从 Pub/Sub 写入 BigQuery 的 Google Cloud Dataflow 管道中捕获丢失的数据集 java.lang.RuntimeException?

    google-cloud-dataflow - 在 Apache Beam 中按顺序触发窗口

    google-bigquery - 了解 Google BigQuery GDELT GKG 2.0 中的主题