snowflake-cloud-data-platform - 使用 order by 和 limit by 簇键时在雪花中进行全表扫描

我有大约450M行的Snowflake表，该表仅包含2个字段，_date是DATE类型，Data是VARIANT类型。集群键是日期，事件〜每天均匀分布

<表类=“s-表”> <标题> 姓名 LINEAR(_DATE) 行字节自动化集群 <正文> 日期事件线性(_DATE) 444,087,723 129228379136 开启

我正在尝试运行以下简单查询

select *
from datedevents 
order by _DATE
limit 200

snowflake 正在进行全表扫描，我不能只查询第一天、第二天等等。”因为用例更复杂，但是为什么snowflake不能使用他的簇键来高效地执行这个操作并且不扫描所有数据呢？我希望他能经历第一次约会、第二次约会等，直到他达到 200 人的限制

最佳答案

更新并进行了重大修复

好的，有一种方法可以通过一个查询获得良好的修剪。

设置:

create or replace transient table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts

查询速度慢:

select *
from test_prune
order by creation_date
limit 10
-- 10s on a S-WH

快速查询:


select *
from test_prune
where creation_date in (select creation_date from test_prune order by 1 limit 10) 
order by creation_date
limit 10

-- 0.2s on a S-WH

有什么区别，为什么这个 in 提示更快，而不需要在这里单独查询？

嗯，我创建了一个transient 表而不是temp 表。对于更多“永久”表，优化器修剪效果更好。

上一个答案

我们需要在这里帮助优化器。我为我的实验创建了一个类似的表:

create or replace temp table test_prune
cluster by (creation_date)
as
select creation_date, body
from temp.public.stackoverflow_posts
order by creation_date

现在让我们对其运行查询:

select *
from test_prune
order by creation_date
limit 10

正如你所说，这需要优化:

我得到了最好的结果，将该查询分成两部分:

首先创建一个表格，其中包含您要查找的日期:

create or replace temp table top_dates
as 
select distinct creation_date
from (
    select creation_date 
    from test_prune
    order by creation_date
    limit 10
);  --687ms

然后所有其他查询都可以使用这些结果:

select *
from test_prune
where creation_date in (select creation_date from top_dates)
order by creation_date
limit 10
;  --308ms

通过这种分离，我们可以将原始查询从 7.9 秒缩短到 0.5 秒 (0.3+0.25)。

关于snowflake-cloud-data-platform - 使用 order by 和 limit by 簇键时在雪花中进行全表扫描，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71486852/

snowflake-cloud-data-platform - 使用 order by 和 limit by 簇键时在雪花中进行全表扫描

更新并进行了重大修复

上一个答案

上一篇：visual-studio-code - 在vscode集成终端中手动运行 `platformio run -v`

下一篇：R 更新 |版本不匹配