python - 使用 aws boto 将文件从 csv 转换为 S3 上的 parquet

标签 python amazon-web-services boto3

我编写了一个脚本,该脚本将在 Athena 上执行查询并将结果文件加载到指定的 aws boto S3 位置。

import boto3
def run_query(query, database, s3_output):
    client = boto3.client('athena', region_name='my-region')
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )
    print('Execution ID: ' + response['QueryExecutionId'])
    return response

query = """select ..."""

database = 'db_name'

path_template = 's3://bucket_name/path/version={}'

current_time =  str(datetime.datetime.now())

result = run_query(query, database, path_template.format(current_time))

它确实有效,但问题是我有一个 csv 文件作为指定位置。但我不需要 csv 文件,我想要 parquet 文件。

我设法获得我想要的东西的唯一方法是下载文件,用 panda 将其转换为 Parquet ,然后重新上传。 令人烦恼的是我不能直接转换而不获取文件。

有人有其他建议吗?我不想使用 CTAS。

最佳答案

您需要使用 CTAS:

CREATE TABLE db.table_name
WITH (
    external_location = 's3://yourbucket/path/table_name',
    format = 'PARQUET',
    parquet_compression = 'GZIP',
    partitioned_by = ARRAY['dt']
)
AS
SELECT
    ...
;

这样,选择的结果将保存为 Parquet。

https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

Use CTAS queries to:

  • Create tables from query results in one step, without repeatedly querying raw data sets. This makes it easier to work with raw data sets.
  • Transform query results into other storage formats, such as Parquet and ORC. This improves query performance and reduces query costs in Athena. For information, see Columnar Storage Formats.
  • Create copies of existing tables that contain only the data you need.

更新(2019.10):

AWS 刚刚发布了适用于 Athena 的 INSERT INTO。

https://docs.aws.amazon.com/en_pv/athena/latest/ug/insert-into.html

Inserts new rows into a destination table based on a SELECT query statement that runs on a source table, or based on a set of VALUES provided as part of the statement. When the source table is based on underlying data in one format, such as CSV or JSON, and the destination table is based on another format, such as Parquet or ORC, you can use INSERT INTO queries to transform selected data into the destination table's format.

有一些限制:

  • INSERT INTO is not supported on bucketed tables. For more information, see Bucketing vs Partitioning.
  • When running an INSERT query on a table with underlying data that is encrypted in Amazon S3, the output files that the INSERT query writes are not encrypted by default. We recommend that you encrypt INSERT query results if you are inserting into tables with encrypted data. For more information about encrypting query results using the console, see Encrypting Query Results Stored in Amazon S3. To enable encryption using the AWS CLI or Athena API, use the EncryptionConfiguration properties of the StartQueryExecution action to specify Amazon S3 encryption options according to your requirements.

关于python - 使用 aws boto 将文件从 csv 转换为 S3 上的 parquet,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58390984/

相关文章:

python - Conemu - 重用实例,但不将其带到前台

python - 使用 boto3 通过与特定文件名匹配的 S3 对象分页

amazon-web-services - 为什么我每次通话时都必须更改 CallerReference?

python - 从 mysqldump 备份执行相互依赖的 View

python - 如何在 QHorizo​​ntalBarSeries/QChart 的指定 x 值处画一条垂直线?

python - 迁移和两个应用程序共享不同计算机上的部分数据库

java - 在 Java 中重用数据库连接

MySQL 错误 1064 - 正确的语法?

amazon-web-services - 为什么只有 root 用户可以从 EC2 实例上传到运行 Java 程序的 S3 存储桶?

python-3.x - 如何在不先写入stdout的情况下将日志直接从内存直接写入AWS S3? (Python,boto3)