我的Beam管道正在写入未分区的BigQuery目标表。 PCollection由数百万个TableRows组成。如果我使用DirectRunner运行,BigQueryIO显然会首先为BigQueryWriteTemp临时文件夹中的每个记录创建一个临时文件。这显然表现不佳。我在这里做错什么了吗?这是正常的批处理作业,而不是流式处理。 (使用DataflowRunner运行的同一作业似乎没有执行此操作)
myrows.apply("WriteToBigQuery",
BigQueryIO.writeTableRows().to(BQ_TARGET_TABLE)
.withWriteDisposition(BigQueryIO.Write.WriteDisposition.WRITE_TRUNCATE)
.withCreateDisposition(BigQueryIO.Write.CreateDisposition.CREATE_NEVER));
这是我们正在查看的日志。这些文件中的每一个都只包含一个TableRow。在DataflowRunner上,似乎只能创建大约3个文件。
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/59668b03-a1e8-4288-a049-3472e7cb6333.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/feeb454b-799e-4d77-bd12-dec313cdadc2.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3c63db33-787f-4215-a425-3446d92157ed.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/87d55556-e012-4bef-8856-69efd4c5ab26.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5e6bfe94-b1c9-49cb-b0c7-a768d78d85f3.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/b236948b-bdf0-4bfe-9d26-4e67c8904320.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/451abb93-e02a-4210-aa46-5afa0c82547d.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/60fd5ecc-8dbe-46e4-884d-3767694b009f.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/f3a5b4e0-e956-4a41-a78d-c7694950b6f1.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/a4e4c74f-d12c-495d-bf28-eb20ee25f086.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/eb3b29e1-cc0c-4a6d-82f4-8527d0c5a51e.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/916ac41b-4ece-42bb-bf24-c5ca17060d1d.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/5b76128f-3c66-4701-92ce-2d3ba2e91f65.
2017-08-14 11:43:49 INFO TableRowWriter:63 - Opening TableRowWriter to gs://my-bucket/tmp/BigQueryWriteTemp/4836c162e29d43f58c4f5cc55b1b3bb3/3a0ae709-756e-452c-9b0f-6efa9c0864ca.
最佳答案
Direct runner用于测试和开发,并包括其他检查以确保管道将在其他运行程序中正确运行。这带来了性能下降的副作用。
以下是其他检查:
关于google-bigquery - Apache Beam BigQueryIO写入速度慢,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45617243/