sql - PostgreSQL 不在过滤的多重排序查询上使用索引

我有一个非常简单的表格

CREATE TABLE approved_posts (
  project_id INTEGER,
  feed_id INTEGER,
  post_id INTEGER,
  approved_time TIMESTAMP NOT NULL,
  post_time TIMESTAMP NOT NULL,
  PRIMARY KEY (project_id, feed_id, post_id)
)

我正在尝试优化此查询:

SELECT *
FROM approved_posts
WHERE feed_id IN (?, ?, ?)
AND project_id = ?
ORDER BY approved_time DESC, post_time DESC
LIMIT 1;

查询优化器正在获取每个与谓词匹配的 approved_post，对所有 100k 个结果进行排序，并返回它找到的最上面的一个。

我在 project_id、feed_id、approved_time、post_time 上确实有一个索引，如果我执行以下任一操作，它将使用该索引:
A. 删除按post_time 排序，或
B. 将 IN (?, ?, ?) 替换为单个 = ?。
然后它简单地进行反向索引扫描以获得第一个结果并且速度非常快。

选项A:

 Limit  (cost=0.43..6.57 rows=1 width=24) (actual time=0.101..0.101 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_approved_time_idx on approved_posts p  (cost=0.43..840483.02 rows=136940 width=24) (actual time=0.100..0.100 rows=1 loops=1)
     Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
     Rows Removed by Filter: 37
 Total runtime: 0.129 ms

选项B:

Limit  (cost=0.43..3.31 rows=1 width=24) (actual time=0.065..0.065 rows=1 loops=1)
   ->  Index Scan Backward using approved_posts_full_pagination_index on approved_posts p  (cost=0.43..126884.70 rows=44049 width=24) (actual time=0.063..0.063 rows=1 loops=1)
     Index Cond: ((project_id = 148772) AND (feed_id = 73321))
 Total runtime: 0.092 ms

但如果没有这些调整，它的性能就不会那么好......

Limit  (cost=169792.16..169792.17 rows=1 width=24) (actual time=510.225..510.225 rows=1 loops=1)
   ->  Sort  (cost=169792.16..170118.06 rows=130357 width=24) (actual time=510.224..510.224 rows=1 loops=1)
     Sort Key: approved_time, post_time
     Sort Method: top-N heapsort  Memory: 25kB
     ->  Bitmap Heap Scan on approved_posts p  (cost=12324.41..169140.38 rows=130357 width=24) (actual time=362.210..469.387 rows=126260 loops=1)
           Recheck Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
           ->  Bitmap Index Scan on approved_posts_feed_id_idx  (cost=0.00..12291.82 rows=130357 width=0) (actual time=354.496..354.496 rows=126260 loops=1)
                 Index Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Total runtime: 510.265 ms

我什至可以在这 5 个提要 ID 上添加一个条件索引，它会再次做正确的事情。

我目前最好的解决方案是将每个 feed_id 放在自己的查询中，并在它们之间进行大量的 UNION。但这并不能很好地扩展，因为我可能想从 30 个提要中选择前 500 个，拉入 15k 行并无缘无故地对它们进行排序。此外，使用此策略管理抵消有些复杂。

有谁知道我如何在我的索引良好的数据上使用两种类型来执行此 IN 子句并让 Postgres 做正确的事情？

我正在使用 Postgres 9.3.3。这是我的索引:

 "approved_posts_project_id_feed_id_post_id_key" UNIQUE CONSTRAINT, btree (project_id, feed_id, post_id)
 "approved_posts_approved_time_idx" btree (approved_time)
 "approved_posts_feed_id_idx" btree (feed_id)
 "approved_posts_full_pagination_index" btree (project_id, feed_id, approved_time, post_time)
 "approved_posts_post_id_idx" btree (post_id)
 "approved_posts_post_time_idx" btree (post_time)
 "approved_posts_project_id_idx" btree (project_id)

所有列都不可为空。

此表有 200 万行，分为 200 个提要 ID 和 19 个项目 ID。

这些是最常见的 Feed ID:

 feed_id | count  
---------+--------
   73607 | 558860
   73837 | 354018
   73832 | 220285
   73836 | 172664
   73321 | 118695
   73819 |  95999
   73821 |  75871
   73056 |  65779
   73070 |  54655
   73827 |  43710
   73079 |  36700
   73574 |  36111
   73055 |  25682
   73072 |  22596
   73589 |  19856
   73953 |  15286
   73159 |  13059
   73839 |   8925

就每个 feedid/projectid 配对的最小/最大/平均基数而言，我们有:

 min |  max   |          avg          
-----+--------+-----------------------
   1 | 559021 | 9427.9140271493212670

最佳答案

有了 feed_id 的可能值列表，Postgres 很难找到最佳查询计划。每个 feed_id 可以与 1 - 559021 行相关联(根据您的数字)。 Postgres 目前还不够聪明，无法单独看到针对 LIMIT 1 特殊情况的潜在优化。多个查询的 UNION ALL(不仅仅是 UNION)，每个查询具有一个 feed_id 和 LIMIT 1，外加另一个外部查询LIMIT 1(就像您似乎已经尝试过的那样)展示了潜力，但需要对可变数量的输入值进行复杂的查询串联。

还有另一种方法可以说服查询规划器它可以使用索引扫描从索引中为每个 feed_id 选择第一行:用 横向加入:

SELECT a.*
FROM   (VALUES (?), (?), (?)) AS t(feed_id)
     , LATERAL (
   SELECT *
   FROM   approved_posts
   WHERE  project_id = ?
   AND    feed_id = t.feed_id
   ORDER  BY approved_time DESC, post_time DESC
   LIMIT  1
   ) a
ORDER  BY approved_time DESC, post_time DESC
LIMIT  1;

或者，对于 feed_id 的可变数量的值更方便:

SELECT a.*
FROM   unnest(?) AS t(feed_id)  -- provide int[] var
     , LATERAL ( ...

为变量传递一个整数数组，如 '{123, 234, 345}'::int[]。这也可以通过使用 VARIADIC 参数的函数优雅地实现。然后你可以传递一个 integer 值的列表:

Pass multiple values in single parameter

(project_id, feed_id, approved_time, post_time) 上的索引适用于此，因为 Postgres 向后扫描索引的速度几乎与向前扫描一样快，但是 (project_id, feed_id, approved_time DESC, post_time DESC) 会更好。见:

Optimizing queries on a range of timestamps (two columns)

如果您不需要返回表的所有列，甚至仅索引扫描也是一种选择。

您的列approved_time、post_time 定义为NOT NULL。否则，您必须做更多:

Unused index in range of dates query

详细介绍 LATERAL 连接技术的相关答案:

Optimize GROUP BY query to retrieve latest record per user

为什么你的选项 A 有效？

仔细观察会发现两件事:

->  Index Scan Backward using approved_posts_approved_time_idx
    on approved_posts p (cost=0.43..840483.02 rows=136940 width=24)
                        (actual time=0.100..0.100 rows=1 loops=1)
     Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))

Bold emphasis mine.

A different, smaller index on just (approved_time) is used.
There is no index condition on feed_id (which would not be possible in this case), but a Filter.

Postgres chooses a completely different strategy: it reads rows from this index bottom-up (Index Scan Backward) until it finds a row matching one of your given values for feed_id. Since you only have very few projects and feeds (200 feed IDs and 19 project IDs), chances are it won't have to discard too many rows before the first match - which is the result. This actually gets faster with more values for feed_id, because the "latest" row is found earlier - unlike my first approach which is faster for fewer values.

A promising alternative strategy! Depending on data distribution and the feeds in your query it may be faster than my first solution - enable it with this index:

"approved_posts_foo_idx" btree (project_id, approved_time DESC, post_time DESC)

有选择地增加列 project_id 和 feed_id 的统计目标可能是值得的，因此可以更准确地估计两种策略之间的临界点。

Postgresql - Query running a lot faster with enable_nestloop=false. Why is the planner not doing the right thing?

由于您的项目只有旧行 (as per comment)，您可以通过提示最大 approved_time(和 post_time，但这可能不是增加很多)-如果您知道每个项目(和/或每个feed_id)，或者至少一个上限。

SELECT ...
WHERE  ...
<b>AND   approved_time <= $upper_bound</b>

关于sql - PostgreSQL 不在过滤的多重排序查询上使用索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30987716/

sql - PostgreSQL 不在过滤的多重排序查询上使用索引

为什么你的选项 A 有效？

上一篇：sql - Postgresql正则表达式匹配函数: regexp_matches

下一篇：postgresql - 无法安装 pgmodeler