python-3.x - 使用文件夹结构迭代 S3 存储桶中的文件

我有一个 S3 存储桶。在存储桶内，我们有一个 2018 年的文件夹，以及我们为每个月和每天收集的一些文件。因此，例如，2018\3\24、2018\3\25 等等。

我们没有将日期放在每一天存储桶中的文件中。

基本上，我想遍历存储桶并使用文件夹结构按“日期”对每个文件进行分类，因为我们需要将其加载到不同的数据库中并且需要一种识别方法。

我已经阅读了大量关于使用 boto3 的帖子，并反复阅读，但是关于是否可以完成我需要的细节似乎存在冲突。

如果有更简单的方法可以做到这一点，请提出建议。

我搞定了
导入 boto3

s3client = boto3.client('s3')
bucket = 'bucketname'
startAfter = '2018'

s3objects= s3client.list_objects_v2(Bucket=bucket, StartAfter=startAfter )
for object in s3objects['Contents']:
    print(object['Key'])

最佳答案

使用 boto3 时，每个请求只能列出 1000 个对象。所以要获取bucket中的所有对象，可以使用s3的paginator .
client.get_paginator('list_objects_v2')是你所需要的。

像这样的东西是你需要的:

import boto3
client = boto3.client('s3')
paginator = client.get_paginator('list_objects_v2')
result = paginator.paginate(Bucket='bucketname',StartAfter='2018')
for page in result:
    if "Contents" in page:
        for key in page[ "Contents" ]:
            keyString = key[ "Key" ]
            print keyString

来自 this文档:

list_objects:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket.

list_objects_v2:

Returns some or all (up to 1000) of the objects in a bucket. You can use the request parameters as selection criteria to return a subset of the objects in a bucket. Note: ListObjectsV2 is the revised List Objects API and we recommend you use this revised API for new application development.

来自 this回答:

list_objects_v2 has added features. Due to the 1000 keys per page listing limits, using marker to list multiple pages can be an headache. Logically, you need to keep track the last key you successfully processed. With ContinuationToken, you don't need to know the last key, you just check existence of NextContinuationToken in the response. You can spawn parallel process to deal with multiple of 1000 keys without dealing with the last key to fetch next page.

关于python-3.x - 使用文件夹结构迭代 S3 存储桶中的文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/49482274/

python-3.x - 使用文件夹结构迭代 S3 存储桶中的文件

上一篇：google-api - 从后端使 Google Cloud CDN 缓存无效

下一篇：regex - 设置代码拼写检查器忽略以 $ 开头的字符串