python - 从 S3 读取 ZIP 文件而不下载整个文件

我们有 5-10GB 大小的 ZIP 文件。典型的 ZIP 文件有 5-10 个内部文件，每个未压缩的大小为 1-5 GB。

我有一套很好的 Python 工具来读取这些文件。基本上，我可以打开一个文件名，如果有 ZIP 文件，工具会在 ZIP 文件中搜索，然后打开压缩文件。这一切都相当透明。

我想将这些文件作为压缩文件存储在 Amazon S3 中。我可以获取 S3 文件的范围，所以应该可以获取 ZIP 中央目录(它是文件的末尾，所以我可以只读取最后的 64KiB)，找到我想要的组件，下载它，然后直接流式传输到调用过程。

所以我的问题是，如何通过标准的 Python ZipFile API 做到这一点？没有记录如何用支持 POSIX 语义的任意对象替换文件系统传输。这是否可以在不重写模块的情况下实现？

最佳答案

这是一种不需要获取整个文件的方法(完整版本可用 here)。

虽然它确实需要 boto(或 boto3)(除非您可以通过 AWS CLI 模拟范围 GET；我猜也很有可能)。

import sys
import zlib
import zipfile
import io

import boto
from boto.s3.connection import OrdinaryCallingFormat


# range-fetches a S3 key
def fetch(key, start, len):
    end = start + len - 1
    return key.get_contents_as_string(headers={"Range": "bytes=%d-%d" % (start, end)})


# parses 2 or 4 little-endian bits into their corresponding integer value
def parse_int(bytes):
    val = ord(bytes[0]) + (ord(bytes[1]) << 8)
    if len(bytes) > 3:
        val += (ord(bytes[2]) << 16) + (ord(bytes[3]) << 24)
    return val


"""
bucket: name of the bucket
key:    path to zipfile inside bucket
entry:  pathname of zip entry to be retrieved (path/to/subdir/file.name)    
"""

# OrdinaryCallingFormat prevents certificate errors on bucket names with dots
# https://stackoverflow.com/questions/51604689/read-zip-files-from-amazon-s3-using-boto3-and-python#51605244
_bucket = boto.connect_s3(calling_format=OrdinaryCallingFormat()).get_bucket(bucket)
_key = _bucket.get_key(key)

# fetch the last 22 bytes (end-of-central-directory record; assuming the comment field is empty)
size = _key.size
eocd = fetch(_key, size - 22, 22)

# start offset and size of the central directory
cd_start = parse_int(eocd[16:20])
cd_size = parse_int(eocd[12:16])

# fetch central directory, append EOCD, and open as zipfile!
cd = fetch(_key, cd_start, cd_size)
zip = zipfile.ZipFile(io.BytesIO(cd + eocd))


for zi in zip.filelist:
    if zi.filename == entry:
        # local file header starting at file name length + file content
        # (so we can reliably skip file name and extra fields)

        # in our "mock" zipfile, `header_offset`s are negative (probably because the leading content is missing)
        # so we have to add to it the CD start offset (`cd_start`) to get the actual offset

        file_head = fetch(_key, cd_start + zi.header_offset + 26, 4)
        name_len = parse_int(file_head[0:2])
        extra_len = parse_int(file_head[2:4])

        content = fetch(_key, cd_start + zi.header_offset + 30 + name_len + extra_len, zi.compress_size)

        # now `content` has the file entry you were looking for!
        # you should probably decompress it in context before passing it to some other program

        if zi.compress_type == zipfile.ZIP_DEFLATED:
            print zlib.decompressobj(-15).decompress(content)
        else:
            print content
        break

在您的情况下，您可能需要将获取的内容写入本地文件(由于文件较大)，除非内存使用不是问题。

关于python - 从 S3 读取 ZIP 文件而不下载整个文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/51351000/

python - 从 S3 读取 ZIP 文件而不下载整个文件

上一篇：python - 有优化这个算法的想法吗？

下一篇：python - pandas:链式方法的组合，如 .resample()、.rolling() 等