postgresql - Postgres 在生产中选择次优查询计划

标签 postgresql indexing

我有一个带有 ORDER 和 LIMIT 的查询来支持分页界面:

SELECT segment_members.id AS t0_r0,
       segment_members.segment_id AS t0_r1,
       segment_members.account_id AS t0_r2,
       segment_members.score AS t0_r3,
       segment_members.created_at AS t0_r4,
       segment_members.updated_at AS t0_r5,
       segment_members.posts_count AS t0_r6,
       accounts.id AS t1_r0,
       accounts.platform AS t1_r1,
       accounts.username AS t1_r2,
       accounts.created_at AS t1_r3,
       accounts.updated_at AS t1_r4,
       accounts.remote_id AS t1_r5,
       accounts.name AS t1_r6,
       accounts.language AS t1_r7,
       accounts.description AS t1_r8,
       accounts.timezone AS t1_r9,
       accounts.profile_image_url AS t1_r10,
       accounts.post_count AS t1_r11,
       accounts.follower_count AS t1_r12,
       accounts.following_count AS t1_r13,
       accounts.uri AS t1_r14,
       accounts.location AS t1_r15,
       accounts.favorite_count AS t1_r16,
       accounts.raw AS t1_r17,
       accounts.followers_completed_at AS t1_r18,
       accounts.followings_completed_at AS t1_r19,
       accounts.followers_started_at AS t1_r20,
       accounts.followings_started_at AS t1_r21,
       accounts.profile_fetched_at AS t1_r22,
       accounts.managed_source_id AS t1_r23
FROM segment_members
INNER JOIN accounts ON accounts.id = segment_members.account_id
WHERE segment_members.segment_id = 1
ORDER BY accounts.follower_count ASC LIMIT 20
OFFSET 0;

以下是表的索引:

accounts
"accounts_pkey" PRIMARY KEY, btree (id)
"index_accounts_on_remote_id_and_platform" UNIQUE, btree (remote_id, platform)
"index_accounts_on_description" btree (description)
"index_accounts_on_favorite_count" btree (favorite_count)
"index_accounts_on_follower_count" btree (follower_count)
"index_accounts_on_following_count" btree (following_count)
"index_accounts_on_lower_username_and_platform" btree (lower(username::text), platform)
"index_accounts_on_post_count" btree (post_count)
"index_accounts_on_profile_fetched_at_and_platform" btree (profile_fetched_at, platform)
"index_accounts_on_username" btree (username)

segment_members
"segment_members_pkey" PRIMARY KEY, btree (id)
"index_segment_members_on_segment_id_and_account_id" UNIQUE, btree (segment_id, account_id)
"index_segment_members_on_account_id" btree (account_id)
"index_segment_members_on_segment_id" btree (segment_id)

在我的开发和暂存数据库中,查询计划如下所示,查询执行得非常快。

 Limit  (cost=4802.15..4802.20 rows=20 width=2086)
   ->  Sort  (cost=4802.15..4803.20 rows=421 width=2086)
         Sort Key: accounts.follower_count
         ->  Nested Loop  (cost=20.12..4790.95 rows=421 width=2086)
               ->  Bitmap Heap Scan on segment_members  (cost=19.69..1244.24 rows=421 width=38)
                     Recheck Cond: (segment_id = 1)
                     ->  Bitmap Index Scan on index_segment_members_on_segment_id_and_account_id  (cost=0.00..19.58 rows=
421 width=0)
                           Index Cond: (segment_id = 1)
               ->  Index Scan using accounts_pkey on accounts  (cost=0.43..8.41 rows=1 width=2048)
                     Index Cond: (id = segment_members.account_id)

然而,在生产中,查询计划如下,并且查询会一直持续(几分钟直到它达到语句超时)。

 Limit  (cost=0.86..25120.72 rows=20 width=2130)
   ->  Nested Loop  (cost=0.86..4614518.64 rows=3674 width=2130)
         ->  Index Scan using index_accounts_on_follower_count on accounts  (cost=0.43..2779897.53 rows=3434917 width=209
2)
         ->  Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members  (cost=0.43..0.52 row
s=1 width=38)
               Index Cond: ((segment_id = 1) AND (account_id = accounts.id))

accounts 在暂存中有大约 600 万行,在生产中有 300 万行。 segment_members 有大约 300k 行在暂存中,400 万行在生产中。是表大小的差异导致了查询计划选择的差异吗?有什么方法可以让 Postgres 在生产环境中使用更快的查询计划?

更新: 这是来自缓慢的生产服务器的 EXPLAIN ANALYZE:

 Limit  (cost=0.86..22525.66 rows=20 width=2127) (actual time=173.148..187568.247 rows=20 loops=1)
   ->  Nested Loop  (cost=0.86..4654749.92 rows=4133 width=2127) (actual time=173.141..187568.193 rows=20 loops=1)
         ->  Index Scan using index_accounts_on_follower_count on accounts  (cost=0.43..2839731.81 rows=3390197 width=2089) (actual time=0.110..180374.279 rows=1401278 loops=1)
         ->  Index Scan using index_segment_members_on_segment_id_and_account_id on segment_members  (cost=0.43..0.53 rows=1 width=38) (actual time=0.003..0.003 rows=0 loops=1401278)
               Index Cond: ((segment_id = 1) AND (account_id = accounts.id))
 Total runtime: 187568.318 ms
(6 rows)

最佳答案

要么您的表格统计信息不是最新的,要么您提供的两个查询非常不同第二个估计要检索 350 万行 (rows=3434917)。 ORDER BY/LIMIT 20 被迫对所有 350 万行进行排序以找到前 20 行,这将非常昂贵 - 除非您有匹配的索引。
第一个查询计划期望对 421 行进行排序。差远了。不同的查询计划并不奇怪。
看到 EXPLAIN ANALYZE 的输出会很有趣,而不仅仅是 EXPLAIN。 (第二个查询很贵!)

这在很大程度上取决于每个 segment_id 有多少个 account_id。如果 segment_id 不是选择性的,则查询不能很快。您唯一的其他选择是 MATERIALIZED VIEW 每个 segment_id 的前 n 行和一个适当的制度来保持最新。

如果您的统计信息不是最新的,只需对两个表运行 ANALYZE 并重试。
这可能有助于增加选定列的统计目标:

ALTER TABLE segment_members ALTER segment_id SET STATISTICS 1000;
ALTER TABLE segment_members ALTER account_id SET STATISTICS 1000;

ALTER TABLE accounts ALTER id             SET STATISTICS 1000;
ALTER TABLE accounts ALTER follower_count SET STATISTICS 1000;

ANALYZE segment_members(segment_id, account_id);
ANALYZE accounts (id, follower_count);

详细信息:

更好的索引

我除了在 segment_members 上现有的 UNIQUE 约束 index_segment_members_on_segment_id_and_account_id 之外,我建议在 accounts 上使用多列索引:

CREATE INDEX index_accounts_on_follower_count ON accounts (id, follower_count)

同样,在创建索引后运行 ANALYZE

有些索引没用?

您问题中的所有其他索引与此查询无关。它们可能对其他目的有用或无用。

这个指标是100%空舱费,降了吧。 (Detailed explanation here.)

<strike>"index_segment_members_on_segment_id" btree (segment_id)</strike>

这个可能没用:

"index_accounts_on_description" btree (description)

因为“描述”通常是自由文本,几乎不用于对行进行排序或在 WHERE 条件下使用合适的运算符。但这只是一个有根据的猜测。

关于postgresql - Postgres 在生产中选择次优查询计划,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25899827/

相关文章:

postgresql - 如何在postgresql中获取每个月的列表日数据

sql - updated_at 和 creatd_at 之间的差异的事件记录子集

python - 在环境中不设置密码的情况下恢复PostgreSQL数据库

python - 确定两个 numpy 数组在 Python 中相交的参数

indexing - iTunes 商店 : App keyword search optimization. 字符串智能索引?

python - NumPy Fancy Indexing - 从不同 channel 裁剪不同的 ROI

sql - 优化 SQL 查询(查找数据中的差距)Postgresql

java - 通过导致 PSQLException 的 Hibernate 和 PostgreSQL 执行查询的差异

optimization - 如何自动确定哪些表需要在 postgresql 中进行真空/重建索引

php - PHP 中的 Mysql 表索引