Java 文件上传到 S3 - 应该多部分加速吗?

标签 java amazon-s3 java-8 aws-sdk aws-java-sdk

我们使用 Java 8 并使用 AWS SDK 以编程方式将文件上传到 AWS S3。对于上传大文件(>100MB),我们了解到首选使用的方法是分段上传。我们试过了,但它似乎并没有加快速度,上传时间几乎与不使用分段上传相同。更糟糕的是,我们甚至遇到内存不足的错误,表示堆空间不足。

问题:

  1. 使用分段上传真的可以加快上传速度吗?如果不是,那为什么要使用它?
  2. 为什么使用分段上传会比不使用更快地占用内存?是否同时上传所有部分?

下面是我们使用的代码:

private static void uploadFileToS3UsingBase64(String bucketName, String region, String accessKey, String secretKey,
        String fileBase64String, String s3ObjectKeyName) {
    
    byte[] bI = org.apache.commons.codec.binary.Base64.decodeBase64((fileBase64String.substring(fileBase64String.indexOf(",")+1)).getBytes());
    InputStream fis = new ByteArrayInputStream(bI);
    
    long start = System.currentTimeMillis();
    AmazonS3 s3Client = null;
    TransferManager tm = null;

    try {

        s3Client = AmazonS3ClientBuilder.standard().withRegion(region)
                .withCredentials(new AWSStaticCredentialsProvider(new BasicAWSCredentials(accessKey, secretKey)))
                .build();
        
        tm = TransferManagerBuilder.standard()
                  .withS3Client(s3Client)
                  .withMultipartUploadThreshold((long) (50* 1024 * 1025))
                  .build();

        ObjectMetadata metadata = new ObjectMetadata();
        metadata.setHeader(Headers.STORAGE_CLASS, StorageClass.Standard);
        PutObjectRequest putObjectRequest = new PutObjectRequest(bucketName, s3ObjectKeyName,
                fis, metadata).withSSEAwsKeyManagementParams(new SSEAwsKeyManagementParams());
        
        Upload upload = tm.upload(putObjectRequest);

        // Optionally, wait for the upload to finish before continuing.
        upload.waitForCompletion();

        long end = System.currentTimeMillis();
        long duration = (end - start)/1000;
        
        // Log status
        System.out.println("Successul upload in S3 multipart. Duration = " + duration);
    } catch (Exception e) {
        e.printStackTrace();
    } finally {
        if (s3Client != null)
            s3Client.shutdown();
        if (tm != null)
            tm.shutdownNow();
    }

}

最佳答案

如果同时上传多个部分,使用 multipart 只会加快上传速度。

在您的代码中,您正在设置 withMultipartUploadThreshold。如果您的上传大小大于该阈值,那么您应该观察到不同部分的并发上传。如果不是,则应仅使用一个上传连接。你是说你有 >100 MB 的文件,并且在你的代码中你有 50 * 1024 * 1025 = 52 480 000 字节作为分段上传阈值,因此应该同时上传该文件的各个部分。

但是,如果您的上传吞吐量无论如何都受到网络速度的限制,则吞吐量不会有任何增加。这可能是您没有观察到任何速度增加的原因。

还有其他使用 multipart 的原因,因为容错原因也推荐使用 multipart。此外,它的最大尺寸大于单次上传。

有关详细信息,请参阅 documentation :

Multipart upload allows you to upload a single object as a set of parts. Each part is a contiguous portion of the object's data. You can upload these object parts independently and in any order. If transmission of any part fails, you can retransmit that part without affecting other parts. After all parts of your object are uploaded, Amazon S3 assembles these parts and creates the object. In general, when your object size reaches 100 MB, you should consider using multipart uploads instead of uploading the object in a single operation.

Using multipart upload provides the following advantages:

  • Improved throughput - You can upload parts in parallel to improve throughput.

  • Quick recovery from any network issues - Smaller part size minimizes the impact of restarting a failed upload due to a network error.

  • Pause and resume object uploads - You can upload object parts over time. After you initiate a multipart upload, there is no expiry; you must explicitly complete or stop the multipart upload.

  • Begin an upload before you know the final object size - You can upload an object as you are creating it.

We recommend that you use multipart upload in the following ways:

  • If you're uploading large objects over a stable high-bandwidth network, use multipart upload to maximize the use of your available bandwidth by uploading object parts in parallel for multi-threaded performance.

  • If you're uploading over a spotty network, use multipart upload to increase resiliency to network errors by avoiding upload restarts. When using multipart upload, you need to retry uploading only parts that are interrupted during the upload. You don't need to restart uploading your object from the beginning.

关于Java 文件上传到 S3 - 应该多部分加速吗?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68255312/

相关文章:

java - 使用 Hibernate 在插入前检查重复值

java - 返回在 REST Controller 中包装 S3Object.getObjectContent() 的 ResponseEntity<InputStreamResource> 是否安全?

java - 如何在 shell 脚本中运行 Picard(生物信息学)工具时消除 Java 异常错误?

java - Lombok 项目中的 IDE 重构支持

java - 为什么 Spark 的 Word2Vec 返回一个 vector ?

java - 如何时不时地查看线程中的值?

java - 大数字的 XSLT format-number() 函数问题

ruby-on-rails-3 - main :Object 的未定义方法 `send_data'

angularjs - 如何使用 imgURL 一次从 Amazon S3 检索多个图像?

java - 为什么返回的函数需要额外的转换(不兼容的类型)?