python - 使用 aws boto 将文件从 csv 转换为 S3 上的 parquet

我编写了一个脚本，该脚本将在 Athena 上执行查询并将结果文件加载到指定的 aws boto S3 位置。

import boto3
def run_query(query, database, s3_output):
    client = boto3.client('athena', region_name='my-region')
    response = client.start_query_execution(
        QueryString=query,
        QueryExecutionContext={
            'Database': database
            },
        ResultConfiguration={
            'OutputLocation': s3_output,
            }
        )
    print('Execution ID: ' + response['QueryExecutionId'])
    return response

query = """select ..."""

database = 'db_name'

path_template = 's3://bucket_name/path/version={}'

current_time =  str(datetime.datetime.now())

result = run_query(query, database, path_template.format(current_time))

它确实有效，但问题是我有一个 csv 文件作为指定位置。但我不需要 csv 文件，我想要 parquet 文件。

我设法获得我想要的东西的唯一方法是下载文件，用 panda 将其转换为 Parquet ，然后重新上传。令人烦恼的是我不能直接转换而不获取文件。

有人有其他建议吗？我不想使用 CTAS。

最佳答案

您需要使用 CTAS:

CREATE TABLE db.table_name
WITH (
    external_location = 's3://yourbucket/path/table_name',
    format = 'PARQUET',
    parquet_compression = 'GZIP',
    partitioned_by = ARRAY['dt']
)
AS
SELECT
    ...
;

这样，选择的结果将保存为 Parquet。

https://docs.aws.amazon.com/athena/latest/ug/ctas-examples.html

https://docs.aws.amazon.com/athena/latest/ug/ctas.html

Use CTAS queries to:

Create tables from query results in one step, without repeatedly querying raw data sets. This makes it easier to work with raw data sets.

Transform query results into other storage formats, such as Parquet and ORC. This improves query performance and reduces query costs in Athena. For information, see Columnar Storage Formats.

Create copies of existing tables that contain only the data you need.

更新(2019.10):

AWS 刚刚发布了适用于 Athena 的 INSERT INTO。

https://docs.aws.amazon.com/en_pv/athena/latest/ug/insert-into.html

Inserts new rows into a destination table based on a SELECT query statement that runs on a source table, or based on a set of VALUES provided as part of the statement. When the source table is based on underlying data in one format, such as CSV or JSON, and the destination table is based on another format, such as Parquet or ORC, you can use INSERT INTO queries to transform selected data into the destination table's format.

有一些限制:

INSERT INTO is not supported on bucketed tables. For more information, see Bucketing vs Partitioning.

When running an INSERT query on a table with underlying data that is encrypted in Amazon S3, the output files that the INSERT query writes are not encrypted by default. We recommend that you encrypt INSERT query results if you are inserting into tables with encrypted data. For more information about encrypting query results using the console, see Encrypting Query Results Stored in Amazon S3. To enable encryption using the AWS CLI or Athena API, use the EncryptionConfiguration properties of the StartQueryExecution action to specify Amazon S3 encryption options according to your requirements.

关于python - 使用 aws boto 将文件从 csv 转换为 S3 上的 parquet，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/58390984/

python - 使用 aws boto 将文件从 csv 转换为 S3 上的 parquet

上一篇：python - 如何通过傅里叶变换获得有关图像清晰度的信息？

下一篇：python - 连接 Pandas 系列并将系列名称添加到多级索引