postgresql - 选择主键 :Why postgres prefers to do sequential scan vs index scan

我有下表

create table log
(
    id bigint default nextval('log_id_seq'::regclass) not null
        constraint log_pkey
            primary key,
    level integer,
    category varchar(255),
    log_time timestamp,
    prefix text,
    message text
);

它包含大约 300 万行。

我正在比较以下查询:

EXPLAIN SELECT id
        FROM log
        WHERE log_time < now() - INTERVAL '3 month'
        LIMIT 100000

产生以下计划:

Limit  (cost=0.00..19498.87 rows=100000 width=8)
  ->  Seq Scan on log  (cost=0.00..422740.48 rows=2168025 width=8)
        Filter: (log_time < (now() - '3 mons'::interval))

并且添加了 ORDER BY id 指令的相同查询:

EXPLAIN SELECT id
        FROM log
        WHERE log_time < now() - INTERVAL '3 month'
        ORDER BY id ASC
        LIMIT 100000

产生

Limit  (cost=0.43..25694.15 rows=100000 width=8)
  ->  Index Scan using log_pkey on log  (cost=0.43..557048.28 rows=2168031 width=8)
        Filter: (log_time < (now() - '3 mons'::interval))

我有以下问题:

没有 ORDER BY 指令让 Postgres 不关心行的顺序。它们也可以分类交付。没有ORDER BY为什么不用索引？
- Postgres 如何在这样的查询中首先使用索引？查询的 WHERE 子句包含一个非索引列，要获取该列，将需要顺序数据库扫描，但带有 ORDER BY 的查询并不表明这一点。
Postgres 手册页说:

For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern

你能为我澄清一下这个说法吗？索引总是有序的。并且读取有序结构总是更快，它总是顺序访问(至少在页面扫描方面)比读取无序数据然后手动排序。

最佳答案

Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.

索引是按顺序读取的，是的，但是 postgres 需要跟进从表中读取行。也就是说，在大多数情况下，如果一个索引标识 100 行，那么 postgres 将需要对该表执行最多 100 次随机读取。

在内部，postgres 规划器对顺序读取和随机读取的权衡不同，随机读取通常要昂贵得多。设置 seq_page_cost 和 random_page_cost 决定了这些。有other settings you can view and tinker with如果你愿意，尽管我建议在修改时非常保守。

让我们回到您之前的问题:

The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?

原因是排序。正如您稍后注意到的，索引不包含约束列，因此使用索引没有任何意义。相反，规划器基本上是在说“读取整个表，找出哪些行符合约束条件，然后返回其中的前 100000 行，无论我们找到它们的顺序如何”。

排序改变了一切。在这种情况下，规划器会说“我们需要按这个字段排序，并且我们有一个已经排序的索引，所以按索引顺序从表中读取行，检查约束，直到我们有 100000 个，并且返回那个集合”。

您会注意到第二个查询的成本估算值(例如“0.43..25694.15”)要高得多——规划器认为从索引扫描中进行如此多的随机读取所花费的成本将远远超过仅一次读取整个表格，无需排序。

希望对您有所帮助，如果您还有其他问题，请告诉我。

关于postgresql - 选择主键 :Why postgres prefers to do sequential scan vs index scan，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43639877/

postgresql - 选择主键 :Why postgres prefers to do sequential scan vs index scan

上一篇：sql - 在第一个 null 之前查找行

下一篇：SQL:JOIN 语法变体