sql - 如果 IN 子句中的表为空，则 Postgresql 查询速度慢

我有以下SQL

WITH filtered_users_pre as (
  SELECT value as username,row_number() OVER (partition by value) AS rk
    FROM "user-stats".tag_table
    WHERE _at_timestamp = 1626955200
       AND tag in ('commercial','marketing')
  ),

  filtered_users as (
    SELECT username
    FROM filtered_users_pre
    WHERE rk = 2
  ),

  valid_users as (
    SELECT aa.username, aa.rank, aa.points, aa.version
    FROM "users-results".ai_algo aa
    WHERE aa._at_timestamp = 1626955200
          AND aa.rank_timeframe = '7d'
          AND aa.username IN (SELECT * FROM filtered_users)
    ORDER BY aa.rank ASC
    LIMIT 15
    OFFSET 0
  )
select * from valid_users;

"user-stats".tag_table 是一个包含大约 6000 万行的表，具有适当的索引。 "users-results".ai_algo 是一个包含大约 1000 万行的表，具有适当的索引。

适当的索引我指的是出现在上面 WHERE 子句中的所有字段。

如果 filtered_users 为空，则查询需要 4 秒才能运行。如果 filtered_users 至少有一行，则需要 400 毫秒。

谁能告诉我为什么？有什么办法可以让查询以相同的性能(400 毫秒)运行，并且 filtered_users 为空？我期望通过减少 filtered_users 中的行数来获得更好的性能。这就是最多 1 行发生的情况。当行数为0时，需要10倍以上。

如果我在 ai_algo 和 filtered_users< 之间放置一个 INNER JOIN 而不是 WHERE 中的 IN 子句，当然会发生同样的情况

更新当 filtered_users 有 0 行(执行 4 秒)时，这是 EXPLAIN (ANALYZE, BUFFERS) 输出查询

Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=3953.945..3953.949 rows=0 loops=1)
  Buffers: shared hit=7456641
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=3953.944..3953.947 rows=0 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Buffers: shared hit=7456641
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa  (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.085..3885.547 rows=313611 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 7793096
              Buffers: shared hit=7456533
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.000..0.000 rows=0 loops=313611)
              Buffers: shared hit=108
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=3.543..3.545 rows=0 loops=1)
                    Filter: (filtered_users_pre.rk = 2)
                    Rows Removed by Filter: 2415
                    Buffers: shared hit=108
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.996..3.356 rows=2415 loops=1)
                          Buffers: shared hit=108
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.990..2.189 rows=2415 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=108
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.612..1.080 rows=2415 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.292..0.292 rows=4830 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.914 ms
Execution Time: 3954.035 ms

这是当 filtered_users 至少有 1 行(300 毫秒)

Limit  (cost=14592.13..15870.39 rows=15 width=35) (actual time=15.958..300.759 rows=15 loops=1)
  Buffers: shared hit=11042
  ->  Nested Loop Semi Join  (cost=14592.13..1795382.62 rows=20897 width=35) (actual time=15.957..300.752 rows=15 loops=1)
        Join Filter: (aa.username = filtered_users_pre.username)
        Rows Removed by Join Filter: 1544611
        Buffers: shared hit=11042
        ->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa (cost=0.56..1718018.61 rows=321495 width=35) (actual time=0.075..10.455 rows=645 loops=1)
"              Index Cond: (rank_timeframe = '7d'::""valid-users-timeframe"")"
              Filter: (_at_timestamp = 1626955200)
              Rows Removed by Filter: 16124
              Buffers: shared hit=10937
        ->  Materialize  (cost=14591.56..14672.51 rows=13 width=21) (actual time=0.003..0.174 rows=2395 loops=645)
              Buffers: shared hit=105
              ->  Subquery Scan on filtered_users_pre  (cost=14591.56..14672.44 rows=13 width=21) (actual time=1.895..3.680 rows=2415 loops=1)
                    Filter: (filtered_users_pre.rk = 1)
                    Buffers: shared hit=105
                    ->  WindowAgg  (cost=14591.56..14638.74 rows=2696 width=29) (actual time=1.894..3.334 rows=2415 loops=1)
                          Buffers: shared hit=105
                          ->  Sort  (cost=14591.56..14598.30 rows=2696 width=21) (actual time=1.889..2.102 rows=2415 loops=1)
                                Sort Key: tag_table_20210722.value
                                Sort Method: quicksort  Memory: 285kB
                                Buffers: shared hit=105
                                ->  Bitmap Heap Scan on tag_table_20210722  (cost=146.24..14437.94 rows=2696 width=21) (actual time=0.604..1.046 rows=2415 loops=1)
"                                      Recheck Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                      Filter: (_at_timestamp = 1626955200)
                                      Rows Removed by Filter: 2415
                                      Heap Blocks: exact=72
                                      Buffers: shared hit=105
                                      ->  Bitmap Index Scan on tag_table_20210722_tag_idx  (cost=0.00..145.57 rows=5428 width=0) (actual time=0.287..0.287 rows=4830 loops=1)
"                                            Index Cond: ((tag)::text = ANY ('{commercial,marketing}'::text[]))"
                                            Buffers: shared hit=33
Planning Time: 0.310 ms
Execution Time: 300.954 ms

最佳答案

问题是，如果没有匹配的filtered_users，PostgreSQL 必须遍历all "users-results".ai_algo 而找不到15 个结果行。如果子查询包含元素，它会快速找到 15 个匹配的 "users-results".ai_algo 行并可以终止处理。

对此你无能为力，但你可以加快扫描 "users-results".ai_algo。目前，您有

->  Index Scan using ai_algo_202107_rank_timeframe_rank_idx on ai_algo_202107 aa
                              ... (actual time=0.085..3885.547 rows=313611 loops=1)
      Index Cond: (rank_timeframe = '7d'::"valid-users-timeframe")
      Filter: (_at_timestamp = 1626955200)
      Rows Removed by Filter: 7793096
      Buffers: shared hit=7456533

您会看到索引扫描并没有达到预期的效果:它从表中读取了 313611 + 7793096 = 8106707 行，并丢弃了除 313611 之外与过滤条件匹配的所有行。

您可以通过创建一个只能直接找到结果行的索引来做得更好:

CREATE INDEX ON "users-results".ai_algo (rank_timeframe, _at_timestamp);

然后您可以删除索引 ai_algo_rank_timeframe_rank_idx，因为新索引可以做旧索引可以做的所有事情(甚至更多)。

关于sql - 如果 IN 子句中的表为空，则 Postgresql 查询速度慢，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/68489129/

sql - 如果 IN 子句中的表为空，则 Postgresql 查询速度慢

上一篇：python - 如何删除 seaborn 散点图顶部和底部的空白

下一篇：Rust:在不可变地借用整个 HashMap 的同时修改 HashMap 中的值