python - pandas 数据帧上的 s3fs gzip 压缩

我正在尝试使用 s3fs 在 S3 上将数据帧作为 CSV 文件写入图书馆和 Pandas 。尽管有文档，但恐怕 gzip 压缩参数不适用于 s3fs。

def DfTos3Csv (df,file):
    with fs.open(file,'wb') as f:
       df.to_csv(f, compression='gzip', index=False)

此代码将数据框保存为 S3 中的新对象，但保存为普通 CSV，而不是 gzip 格式。另一方面，读取功能使用此压缩参数工作正常。

def s3CsvToDf(file):
   with fs.open(file) as f:
      df = pd.read_csv(f, compression='gzip')
  return df

写问题的建议/替代方案？提前谢谢你!

最佳答案

to_csv() 函数的压缩参数在写入流时不起作用。您必须分别进行压缩和上传。

import gzip
import boto3
from io import BytesIO, TextIOWrapper

buffer = BytesIO()

with gzip.GzipFile(mode='w', fileobj=buffer) as zipped_file:
    df.to_csv(TextIOWrapper(zipped_file, 'utf8'), index=False)

s3_resource = boto3.resource('s3')
s3_object = s3_resource.Object('bucket_name', 'key')
s3_object.put(Body=buffer.getvalue())

关于python - pandas 数据帧上的 s3fs gzip 压缩，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50350844/

上一篇：Python 魔法元类创建

下一篇：python - 在python中从类vs对象调用函数

相关文章：

Python - 以 UTF-16LE 格式保存 CSV 文件

python-3.x - aiobotocore - 导入错误 : cannot import name 'InvalidIMDSEndpointError'

python - 如何在 python 中使用 pyarrow 从 S3 读取分区 Parquet 文件

amazon-web-services - AWS S3 在 getSignedUrl 过期后优雅地处理 403

amazon-s3 - 记录 pyarrow 在 S3 上创建的 Parquet 文件名

Python 3.0 : Looping list index that is "out of range"

python - 在python中返回列表的正数

python - BeautifulSoup - <em> 给我的结果带来麻烦

amazon-web-services - 如何使用 AWS DMS 以 S3 作为目标保留列名？

macos - 适用于 OS X 的 Amazon S3 GUI 客户端，允许 AWS STS 担任角色