amazon-athena - Redshift Spectrum 性能与 Athena

我在 S3 中有一个存储桶，其中包含 Parquet 文件并按日期分区。

使用以下查询:

select
    count(1)
from logs.logs_prod
where partition_1 = '2019' and partition_2 = '03'

直接在 Athena 中运行该查询，执行时间不到 10 秒。但是当我在 Redshift 中运行相同的查询时，需要 3 分钟多的时间。它们都返回相同的正确值，在本例中，该分区中的行数少于 8 万行。

我使用 AWS Glue 作为 Athena 和 Redshift 的元数据存储。

Redshift 的查询计划如下:

QUERY PLAN
XN Limit  (cost=250000037.51..250000037.51 rows=1 width=8)
  ->  XN Aggregate  (cost=250000037.51..250000037.51 rows=1 width=8)
        ->  XN Partition Loop  (cost=250000000.00..250000035.00 rows=1000 width=8)
              ->  XN Seq Scan PartitionInfo of logs.logs_prod  (cost=0.00..15.00 rows=1 width=0)
                    Filter: (((partition_1)::text = '2019'::text) AND ((partition_2)::text = '03'::text))
              ->  XN S3 Query Scan logs_prod  (cost=125000000.00..125000010.00 rows=1000 width=8)
                    ->  S3 Aggregate  (cost=125000000.00..125000000.00 rows=1000 width=0)
                          ->  S3 Seq Scan logs.logs_prod location:"s3://logs-prod/" format:PARQUET  (cost=0.00..100000000.00 rows=10000000000 width=0)

此问题是否是 Redshift Spectrum 配置问题？ Redshift 中的查询是否有可能无法接近 Athena？

最佳答案

我认为你不应该过分重视这个测试。从计划来看，它似乎没有利用 Parquet 文件包含有关每个文件中行数的元数据的事实 - 这是我相信 Athena/Parquet 可以做到的事情。

Athena 与 Redshift Spectrum 的实际性能很难衡量，因为使用 Athena，您不知道获得多少容量(但数量很多)，而在 Redshift Spectrum 中，您获得的专用容量取决于你的集群大小。对于具有约 20 个 CPU 的 Redshift 集群，我发现 Athena 对于大多数查询来说性能更好，但更大的 Redshift 集群可能会获得更好的性能。

关于amazon-athena - Redshift Spectrum 性能与 Athena，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55654493/

amazon-athena - Redshift Spectrum 性能与 Athena

上一篇：reactjs - 使用 React-semantic UI 进行条件渲染

下一篇：java - Maven集成-测试如何为不同的Profiles正确设置POM