postgresql - 为什么索引中列的顺序对 Postgresql 中的分组很重要?

标签 postgresql indexing group-by

我有一个比较大的表(大约一百万条记录),包含以下列:

  • account: character varying(36) not null
  • group: character varying(255) not null
  • 分类:字符变化(255)不为空
  • 大小:整数不为空

该帐户实际上是一个 UUID,但我认为这在这里并不重要。如果我执行以下简单查询,在我的机器上大约需要 16 秒:

select account, group, classification, max(size) 
from mytable 
group by account, group, classification

到目前为止一切顺利。假设我添加一个索引:

CREATE INDEX concurrently ON mytable (account, group, classification);

如果我再次执行相同的查询,它现在会在不到半秒的时间内返回结果。解释查询也清楚地表明使用了索引。

但是,如果我将查询改写为

select account, group, classification, max(size) 
from mytable 
group by account, classification, group

又用了16秒,索引不再使用。在我看来,分组标准的顺序并不重要,但我不是专家。知道为什么 Postgresql 不能(或不)优化后一个查询。我在 Postgresql 9.4 中尝试过这个

编辑:根据要求,这里是解释的输出。对于索引调用:

Group  (cost=0.55..133878.11 rows=95152 width=76) (actual time=0.090..660.739 rows=807 loops=1)
  Group Key: group_id, classification_id, account_id
  ->  Index Only Scan using mytable_group_id_classification_id_account_id_idx on mytable  (cost=0.55..126741.72 rows=951518 width=76) (actual time=0.088..534.645 rows=951518 loops=1)
        Heap Fetches: 951518
Planning time: 0.106 ms
Execution time: 660.852 ms

对于更改了 groupby 标准顺序的调用:

Group  (cost=162327.31..171842.49 rows=95152 width=76) (actual time=11114.130..13938.487 rows=807 loops=1)"
  Group Key: group_id, account_id, classification_id
  ->  Sort  (cost=162327.31..164706.10 rows=951518 width=76) (actual time=11114.127..13775.235 rows=951518 loops=1)
        Sort Key: group_id, account_id, classification_id
        Sort Method: external merge  Disk: 81136kB
        ->  Seq Scan on mytable  (cost=0.00..25562.18 rows=951518 width=76) (actual time=0.009..192.259 rows=951518 loops=1)
Planning time: 0.111 ms
Execution time: 13948.380 ms

最佳答案

实际上,GROUP BY 子句中列的顺序确实会影响结果。默认情况下,结果将按 GROUP BY 中的列排序。如果您设置自己的 ORDER BY,结果和索引用法将相同。

演示:

CREATE TABLE coconuts (
  mass int,
  volume int,
  loveliness int
);

INSERT INTO coconuts (mass, volume, loveliness)
  SELECT (random() * 5)::int
       , (random() * 5)::int
       , (random() * 1000 + 9000)::int
  FROM GENERATE_SERIES(1,10000000);

请注意 GROUP BY 中列的顺序如何影响顺序:

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY mass, volume;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    0 |      1 | 10000
    0 |      2 | 10000
...

SELECT mass, volume, max(loveliness)
FROM coconuts
GROUP BY volume, mass;

 mass | volume |  max  
------+--------+-------
    0 |      0 | 10000
    1 |      0 | 10000
    2 |      0 | 10000
...

以及它如何影响查询计划:

CREATE INDEX ON coconuts (mass, volume);
SET enable_seqscan=false; --To force the index if possible

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (mass, volume);
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)


EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY (volume, mass);
                                            QUERY PLAN                                           
------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=10001658532.83..10001758932.83 rows=40000 width=12)
   Group Key: volume, mass
   ->  Sort  (cost=10001658532.83..10001683532.83 rows=10000000 width=12)
         Sort Key: volume, mass
         ->  Seq Scan on coconuts  (cost=10000000000.00..10000154055.00 rows=10000000 width=12)
(5 rows)

但是,如果您将 ORDER BY 与原始 GROUP BY 匹配,则原始查询计划将返回,至少在 postgres 11.5 中是这样。

EXPLAIN
  SELECT mass, volume, max(loveliness)
  FROM coconuts
  GROUP BY volume, mass
  ORDER BY mass, volume;
                                                           QUERY PLAN                                                           
--------------------------------------------------------------------------------------------------------------------------------
 Finalize GroupAggregate  (cost=1000.46..460459.11 rows=40000 width=12)
   Group Key: mass, volume
   ->  Gather Merge  (cost=1000.46..459459.11 rows=80000 width=12)
         Workers Planned: 2
         ->  Partial GroupAggregate  (cost=0.43..449225.10 rows=40000 width=12)
               Group Key: mass, volume
               ->  Parallel Index Scan using coconuts_mass_volume_idx on coconuts  (cost=0.43..417575.10 rows=4166667 width=12)
(7 rows)

关于postgresql - 为什么索引中列的顺序对 Postgresql 中的分组很重要?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41304688/

相关文章:

python - 在 python/pandas 数据框中使用 group by 函数

sql - 任何人都可以发布等效的 rails 3 sql 语句吗?

sql - PostgreSQL 中的 TRIGGER ON VIEW 不会触发

c# - C#中如何使用json对象

indexing - IP 地址的索引范围搜索算法

sql - 如何在sql中获取按ID分组的最大行数

mysql - 我怎样才能计算总结果并将它们分组?

postgresql - 特性 `diesel::Expression` 没有为 `bigdecimal::BigDecimal` 实现

sql - 索引 ORDER BY 与 LIMIT 1

r - 如何将 S4 对象的 setMethod `[` 应用于插槽中的 data.table