java - 使用 java 下载 Blob 会产生一个巨大的文件

标签 java google-bigquery google-cloud-storage

这是 blob 信息:

Blob{bucket=some_bucket,name=somefile-000000000001.json.gz, Generation=1539720839099466,size=42455994,content-type=application/octet-stream,metadata=null}

somefile-...json.gz 是来自 BigQuery 的转储(添加所有文件后总共约为 4gig)

如您所知,大小约为 42megs。但是当我执行 blob.downloadTo(...file) 时,它会运行并运行,并且可以轻松达到 >300 GB 的大小,并且似乎可以永远运行...这对我来说似乎很奇怪,因为它几乎是相同的代码比谷歌的例子。

有趣的事实毫无值(value):

  • 我们的 BigQuery 表大小为 19.08 GB

有人有什么想法吗?

转储到我们的存储桶的代码示例

    String bucketUrl = "gs://" + BUCKET_NAME + "/"+table.getDataset()+"/"+filename+"-*." + EXPORT_EXTENSION;

    log.info("Exporting table " + table.getTable() + " to " + bucketUrl);
    ExtractJobConfiguration extractConfiguration = ExtractJobConfiguration.newBuilder(table, bucketUrl)
        .setCompression(EXPORT_COMPRESSION)
        .setFormat(EXPORT_FORMAT)
        .build();

    Job job = bigquery.create(JobInfo.of(extractConfiguration));
    try {
        // Wait for the job to complete
        Job completedJob = job.waitFor(RetryOption.initialRetryDelay(Duration.ofSeconds(1)),
                RetryOption.totalTimeout(Duration.ofMinutes(3)));
        if (completedJob != null && completedJob.getStatus().getError() == null) {
            return true;
        } else {
            log.error(completedJob.getStatus().getError());
            throw new BigQueryException(1, "Unable to complete the export", completedJob.getStatus().getError());
        }
    }
    catch (InterruptedException e) {
        Thread.currentThread().interrupt();
    }
    return false;

以及要下载的代码(其中 blob = Blob{bucket=some_bucket, name=somefile-000000000001.json.gz, Generation=1539720839099466, size=42455994, content-type=application/octet-stream,metadata=null} )

    Blob blob = storage.get(BlobId.of(bucketName, srcFilename));
    blob.downloadTo(destFilePath); 

最佳答案

我使用了以下代码,成功导出并可以下载压缩文件:

import com.google.api.gax.paging.Page;
import com.google.cloud.storage.Bucket;
import com.google.cloud.storage.BucketInfo;
import com.google.cloud.storage.Storage;
import com.google.cloud.storage.StorageOptions;
import com.google.cloud.storage.Blob;
import com.google.cloud.storage.BlobId;
import com.google.cloud.bigquery.BigQuery;
import com.google.cloud.bigquery.BigQueryOptions;
import com.google.cloud.bigquery.BigQueryException;
import com.google.cloud.bigquery.ExtractJobConfiguration;
import com.google.cloud.bigquery.TableId;
import com.google.cloud.bigquery.Job;
import com.google.cloud.bigquery.JobInfo;

import java.util.Date;
import java.nio.file.Path;
import java.nio.file.Paths;
public class QuickstartSample {
  public static void main(String... args) throws Exception {
    // Instantiates clients
    Storage storage = StorageOptions.getDefaultInstance().getService();
    BigQuery bigquery = BigQueryOptions.getDefaultInstance().getService();
    TableId table = TableId.of("dataset","table");
    // The name for the new bucket
    String bucketName = "bucket";
    ExtractJobConfiguration extractConfiguration = ExtractJobConfiguration.newBuilder(table, "gs://"+bucketName+"/somefile-*.json.gz")
        .setCompression("GZIP")
        .setFormat("NEWLINE_DELIMITED_JSON")
        .build();
    Job startedJob = bigquery.create(JobInfo.of(extractConfiguration));
        // Wait for the job to complete
    while(!startedJob.isDone()){
        System.out.println("Waiting for job " + startedJob.getJobId().getJob() + " to complete");
        Thread.sleep(1000L);
    }
    if (startedJob.getStatus().getError() == null) {
        System.out.println("Job " + startedJob.getJobId().getJob() + " succeeded");
    } else {
        System.out.println("Job " + startedJob.getJobId().getJob() + " failed");
        System.out.println("Error: " + startedJob.getStatus().getError());
    }
    Bucket bucket = storage.get(bucketName);
    Page<Blob> blobs = bucket.list();
    System.out.println("Downloading");
    for (Blob blob : blobs.iterateAll()) {
        System.out.println("Name: " + blob.getName());
        System.out.println("Size: " + blob.getSize());
        Path destFilePath = Paths.get(blob.getName());
        blob.downloadTo(destFilePath);
    }
  }
}

我使用的 pom.xml 文件依赖项如下:

<dependency>
 <groupId>com.google.cloud</groupId>
 <artifactId>google-cloud-storage</artifactId>
 <version>1.38.0</version>
</dependency>
<dependency>
 <groupId>com.google.cloud</groupId>
 <artifactId>google-cloud-bigquery</artifactId>
 <version>1.48.0</version>
</dependency>

希望有帮助。

关于java - 使用 java 下载 Blob 会产生一个巨大的文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52845690/

相关文章:

google-bigquery - 在 BigQuery 中使用多个嵌套字段

sql - 在 Bigquery 中编写递归查询的最简单方法

google-bigquery - 提取 BigQuery 分区表

java - 在 LinkedList 中添加一个点而不覆盖另一个点 - Java

使用递归 Hashmap 的 Java 泛型类型安全警告

java - 如何停止 Beanshell 断言中程序的进一步执行

ios - 上传谷歌云存储 iOS

python - 在 Google Cloud 函数中运行 gsutil 命令

google-cloud-storage - 从停止的实例中分离磁盘

java - 从 Excel 文件中读取内容