我有一个比较大的表(大约一百万条记录),包含以下列:
- account: character varying(36) not null
- group: character varying(255) not null
- 分类:字符变化(255)不为空
- 大小:整数不为空
该帐户实际上是一个 UUID,但我认为这在这里并不重要。如果我执行以下简单查询,在我的机器上大约需要 16 秒:
select account, group, classification, max(size)
from mytable
group by account, group, classification
到目前为止一切顺利。假设我添加一个索引:
CREATE INDEX concurrently ON mytable (account, group, classification);
如果我再次执行相同的查询,它现在会在不到半秒的时间内返回结果。解释查询也清楚地表明使用了索引。
但是,如果我将查询改写为
select account, group, classification, max(size)
from mytable
group by account, classification, group
又用了16秒,索引不再使用。在我看来,分组标准的顺序并不重要,但我不是专家。知道为什么 Postgresql 不能(或不)优化后一个查询。我在 Postgresql 9.4 中尝试过这个
编辑:根据要求,这里是解释的输出。对于索引调用:
Group (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
Group Key: group_id, classification_id, account_id
-> Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms
对于更改了 groupby 标准顺序的调用:
Group (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
Group Key: group_id, account_id, classification_id
-> Sort (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
Sort Key: group_id, account_id, classification_id
Sort Method: external merge Disk: 81136kB
-> Seq Scan on mytable (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms
最佳答案
实际上,GROUP BY
子句中列的顺序确实会影响结果。默认情况下,结果将按 GROUP BY
中的列排序。如果您设置自己的 ORDER BY
,结果和索引用法将相同。
演示:
CREATE TABLE coconuts (
mass int,
volume int,
loveliness int
);
INSERT INTO coconuts (mass, volume, loveliness)
SELECT (random() * 5)::int
, (random() * 5)::int
, (random() * 1000 + 9000)::int
FROM GENERATE_SERIES(1,10000000);
请注意 GROUP BY
中列的顺序如何影响顺序:
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;
mass | volume | max
------+--------+-------
0 | 0 | 10000
0 | 1 | 10000
0 | 2 | 10000
...
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;
mass | volume | max
------+--------+-------
0 | 0 | 10000
1 | 0 | 10000
2 | 0 | 10000
...
以及它如何影响查询计划:
CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY (mass, volume);
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=1000.46..460459.11 rows=40000 width=12)
Group Key: mass, volume
-> Gather Merge (cost=1000.46..459459.11 rows=80000 width=12)
Workers Planned: 2
-> Partial GroupAggregate (cost=0.43..449225.10 rows=40000 width=12)
Group Key: mass, volume
-> Parallel Index Scan using coconuts_mass_volume_idx on coconuts (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY (volume, mass);
QUERY PLAN
------------------------------------------------------------------------------------------------
GroupAggregate (cost=10001658532.83..10001758932.83 rows=40000 width=12)
Group Key: volume, mass
-> Sort (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
Sort Key: volume, mass
-> Seq Scan on coconuts (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)
但是,如果您将 ORDER BY
与原始 GROUP BY
匹配,则原始查询计划将返回,至少在 postgres 11.5 中是这样。
EXPLAIN
SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass
ORDER BY mass, volume;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Finalize GroupAggregate (cost=1000.46..460459.11 rows=40000 width=12)
Group Key: mass, volume
-> Gather Merge (cost=1000.46..459459.11 rows=80000 width=12)
Workers Planned: 2
-> Partial GroupAggregate (cost=0.43..449225.10 rows=40000 width=12)
Group Key: mass, volume
-> Parallel Index Scan using coconuts_mass_volume_idx on coconuts (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)
关于postgresql - 为什么索引中列的顺序对 Postgresql 中的分组很重要?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41304688/