我有一个非常简单的表格
CREATE TABLE approved_posts (
project_id INTEGER,
feed_id INTEGER,
post_id INTEGER,
approved_time TIMESTAMP NOT NULL,
post_time TIMESTAMP NOT NULL,
PRIMARY KEY (project_id, feed_id, post_id)
)
我正在尝试优化此查询:
SELECT *
FROM approved_posts
WHERE feed_id IN (?, ?, ?)
AND project_id = ?
ORDER BY approved_time DESC, post_time DESC
LIMIT 1;
查询优化器正在获取每个与谓词匹配的 approved_post
,对所有 100k 个结果进行排序,并返回它找到的最上面的一个。
我在 project_id、feed_id、approved_time、post_time
上确实有一个索引,如果我执行以下任一操作,它将使用该索引:
A. 删除按post_time
排序,或
B. 将 IN (?, ?, ?)
替换为单个 = ?
。
然后它简单地进行反向索引扫描以获得第一个结果并且速度非常快。
选项A:
Limit (cost=0.43..6.57 rows=1 width=24) (actual time=0.101..0.101 rows=1 loops=1)
-> Index Scan Backward using approved_posts_approved_time_idx on approved_posts p (cost=0.43..840483.02 rows=136940 width=24) (actual time=0.100..0.100 rows=1 loops=1)
Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Rows Removed by Filter: 37
Total runtime: 0.129 ms
选项B:
Limit (cost=0.43..3.31 rows=1 width=24) (actual time=0.065..0.065 rows=1 loops=1)
-> Index Scan Backward using approved_posts_full_pagination_index on approved_posts p (cost=0.43..126884.70 rows=44049 width=24) (actual time=0.063..0.063 rows=1 loops=1)
Index Cond: ((project_id = 148772) AND (feed_id = 73321))
Total runtime: 0.092 ms
但如果没有这些调整,它的性能就不会那么好......
Limit (cost=169792.16..169792.17 rows=1 width=24) (actual time=510.225..510.225 rows=1 loops=1)
-> Sort (cost=169792.16..170118.06 rows=130357 width=24) (actual time=510.224..510.224 rows=1 loops=1)
Sort Key: approved_time, post_time
Sort Method: top-N heapsort Memory: 25kB
-> Bitmap Heap Scan on approved_posts p (cost=12324.41..169140.38 rows=130357 width=24) (actual time=362.210..469.387 rows=126260 loops=1)
Recheck Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
-> Bitmap Index Scan on approved_posts_feed_id_idx (cost=0.00..12291.82 rows=130357 width=0) (actual time=354.496..354.496 rows=126260 loops=1)
Index Cond: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Total runtime: 510.265 ms
我什至可以在这 5 个提要 ID 上添加一个条件索引,它会再次做正确的事情。
我目前最好的解决方案是将每个 feed_id
放在自己的查询中,并在它们之间进行大量的 UNION
。但这并不能很好地扩展,因为我可能想从 30 个提要中选择前 500 个,拉入 15k 行并无缘无故地对它们进行排序。此外,使用此策略管理抵消有些复杂。
有谁知道我如何在我的索引良好的数据上使用两种类型来执行此 IN
子句并让 Postgres 做正确的事情?
我正在使用 Postgres 9.3.3。这是我的索引:
"approved_posts_project_id_feed_id_post_id_key" UNIQUE CONSTRAINT, btree (project_id, feed_id, post_id)
"approved_posts_approved_time_idx" btree (approved_time)
"approved_posts_feed_id_idx" btree (feed_id)
"approved_posts_full_pagination_index" btree (project_id, feed_id, approved_time, post_time)
"approved_posts_post_id_idx" btree (post_id)
"approved_posts_post_time_idx" btree (post_time)
"approved_posts_project_id_idx" btree (project_id)
所有列都不可为空。
此表有 200 万行,分为 200 个提要 ID 和 19 个项目 ID。
这些是最常见的 Feed ID:
feed_id | count
---------+--------
73607 | 558860
73837 | 354018
73832 | 220285
73836 | 172664
73321 | 118695
73819 | 95999
73821 | 75871
73056 | 65779
73070 | 54655
73827 | 43710
73079 | 36700
73574 | 36111
73055 | 25682
73072 | 22596
73589 | 19856
73953 | 15286
73159 | 13059
73839 | 8925
就每个 feedid
/projectid
配对的最小/最大/平均基数而言,我们有:
min | max | avg
-----+--------+-----------------------
1 | 559021 | 9427.9140271493212670
最佳答案
有了 feed_id
的可能值列表,Postgres 很难找到最佳查询计划。每个 feed_id
可以与 1 - 559021 行相关联(根据您的数字)。 Postgres 目前还不够聪明,无法单独看到针对 LIMIT 1
特殊情况的潜在优化。多个查询的 UNION ALL
(不仅仅是 UNION
),每个查询具有一个 feed_id
和 LIMIT 1
,外加另一个外部查询LIMIT 1
(就像您似乎已经尝试过的那样)展示了潜力,但需要对可变数量的输入值进行复杂的查询串联。
还有另一种方法可以说服查询规划器它可以使用索引扫描从索引中为每个 feed_id
选择第一行:用 横向
加入:
SELECT a.*
FROM (VALUES (?), (?), (?)) AS t(feed_id)
, LATERAL (
SELECT *
FROM approved_posts
WHERE project_id = ?
AND feed_id = t.feed_id
ORDER BY approved_time DESC, post_time DESC
LIMIT 1
) a
ORDER BY approved_time DESC, post_time DESC
LIMIT 1;
或者,对于 feed_id
的可变数量的值更方便:
SELECT a.*
FROM unnest(?) AS t(feed_id) -- provide int[] var
, LATERAL ( ...
为变量传递一个整数数组,如 '{123, 234, 345}'::int[]
。这也可以通过使用 VARIADIC
参数的函数优雅地实现。然后你可以传递一个 integer
值的列表:
(project_id, feed_id, approved_time, post_time)
上的索引适用于此,因为 Postgres 向后扫描索引的速度几乎与向前扫描一样快,但是 (project_id, feed_id, approved_time DESC, post_time DESC)
会更好。见:
如果您不需要返回表的所有列,甚至仅索引扫描也是一种选择。
您的列approved_time
、post_time
定义为NOT NULL
。否则,您必须做更多:
详细介绍 LATERAL
连接技术的相关答案:
为什么你的选项 A 有效?
仔细观察会发现两件事:
-> Index Scan Backward using approved_posts_approved_time_idx on approved_posts p (cost=0.43..840483.02 rows=136940 width=24) (actual time=0.100..0.100 rows=1 loops=1) Filter: (feed_id = ANY ('{73321,73771,73772,73773,73774}'::integer[]))
Bold emphasis mine.
- A different, smaller index on just
(approved_time)
is used. - There is no index condition on
feed_id
(which would not be possible in this case), but a Filter.
Postgres chooses a completely different strategy: it reads rows from this index bottom-up (Index Scan Backward
) until it finds a row matching one of your given values for feed_id
. Since you only have very few projects and feeds (200 feed IDs and 19 project IDs
), chances are it won't have to discard too many rows before the first match - which is the result. This actually gets faster with more values for feed_id
, because the "latest" row is found earlier - unlike my first approach which is faster for fewer values.
A promising alternative strategy! Depending on data distribution and the feeds in your query it may be faster than my first solution - enable it with this index:
"approved_posts_foo_idx" btree (project_id, approved_time DESC, post_time DESC)
有选择地增加列 project_id
和 feed_id
的统计目标可能是值得的,因此可以更准确地估计两种策略之间的临界点。
由于您的项目只有旧行 (as per comment),您可以通过提示最大 approved_time
(和 post_time
,但这可能不是增加很多)-如果您知道每个项目(和/或每个feed_id
),或者至少一个上限。
SELECT ...
WHERE ...
<b>AND approved_time <= $upper_bound</b>
关于sql - PostgreSQL 不在过滤的多重排序查询上使用索引,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/30987716/