amazon-web-services - 在 s3 中重命名 Pyspark 输出文件

标签 amazon-web-services amazon-s3 pyspark

我使用以下命令将 pyspark 数据帧保存到 s3:

df.coalesce(1).write.partitionBy('DATE'
                                    ).format("com.databricks.spark.csv"
                                    ).mode('overwrite'
                                    ).option("header", "true"
                                    ).save(output_path)

哪个给我:

file_path/FLORIDA/DATE=2019-04-29/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv
file_path/FLORIDA/DATE=2019-04-30/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv
file_path/FLORIDA/DATE=2019-05-01/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv
file_path/FLORIDA/DATE=2019-05-02/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv

是否有一种简单的方法可以在 s3 中重新格式化此路径以遵循此结构？:

file_path/FLORIDA/allocation_FLORIDA_20190429.csv
file_path/FLORIDA/allocation_FLORIDA_20190430.csv
file_path/FLORIDA/allocation_FLORIDA_20190501.csv
file_path/FLORIDA/allocation_FLORIDA_20190502.csv

我有几千个这样的东西，所以如果有一种编程方式可以做到这一点，那将不胜感激!

最佳答案

想出了一个不错的方法来解决这个问题:

import datetime
import boto3
s3 = boto3.resource('s3')

for i in range(5):
    date = datetime.datetime(2019,4,29)
    date += datetime.timedelta(days=i)
    date = date.strftime("%Y-%m-%d")
    print(date)
    old_date = 'file_path/FLORIDA/DATE={}/part-00000-1691d1c6-2c49-4cbe-b454-d0165a0d7bde.c000.csv'.format(date)
    print(old_date)
    date = date.replace('-','')
    new_date = 'file_path/FLORIDA/allocation_FLORIDA_{}.csv'.format(date)
    print(new_date)

    s3.Object('my_bucket', new_date).copy_from(CopySource='my_bucket/' + old_date)

    s3.Object('my_bucket', old_date).delete()

关于amazon-web-services - 在 s3 中重命名 Pyspark 输出文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62347959/

上一篇：python - Keras BERT - 高精度、验证 acc、f1、auc -> 但预测不佳

下一篇：Flutter 获取真正的设备方向

scala - 如何在 EMR 上使用 spark 有效地读取/解析 s3 文件夹中的 .gz 文件负载

javascript - 直接上传到 AWS S3 : SignatureDoesNotMatch only for IE

amazon-web-services - Elasticsearch增量执行快照

python - 如何在 PySpark 中创建空的 Spark DataFrame 并追加数据？

mysql - Pyspark DataFrameWriter jdbc 函数的忽略选项是忽略整个事务还是只忽略有问题的行？

amazon-web-services - ASP.NETCore Signalr 在 AWS 上不起作用

amazon-web-services - 如何在 AWS Lambda 上运行像 `pdflatex` 这样的二进制文件？

apache-spark - PySpark Palantir Foundry 中增量追加更新的行(基于某些列)

amazon-web-services - 如何根据参数中的项目数量在cloudformation上创建动态列表？