postgresql - 大连接的查询优化

(第 11.2 页)

我有以下表结构:

CREATE TABLE site_tally
(
    id               serial,
    dt_created       timestamp WITHOUT TIME ZONE DEFAULT now() NOT NULL,
    dt_updated       timestamp WITHOUT TIME ZONE,
    geo              text                                      NOT NULL,
    dt_tally         date                                      NOT NULL,
    parent_site      text                                      NOT NULL,
    site_id          integer                                   NOT NULL,
    tracked          boolean                                   NOT NULL,
    utm_parameter_id integer                                   NOT NULL,
    device           text                                      NOT NULL,
    layout_id        integer                                   NOT NULL,
    views            integer                     DEFAULT 0,
    PRIMARY KEY (id, geo)
) PARTITION BY LIST (geo);

CREATE UNIQUE INDEX site_tally_uindex
    ON site_tally (geo, dt_tally, parent_site, site_id, tracked, utm_parameter_id, device, layout_id);

CREATE TABLE site_tally_uk PARTITION OF site_tally FOR VALUES IN ('UK');
CREATE TABLE site_tally_us PARTITION OF site_tally FOR VALUES IN ('US');
CREATE TABLE site_tally_au PARTITION OF site_tally FOR VALUES IN ('AU');


CREATE TABLE utm_parameters
(
    id         serial                            NOT NULL PRIMARY KEY,
    dt_created timestamp DEFAULT now()           NOT NULL,
    source     text      DEFAULT 'default'::text NOT NULL,
    medium     text      DEFAULT 'default'::text NOT NULL,
    campaign   text      DEFAULT 'default'::text NOT NULL,
    term       text      DEFAULT 'default'::text NOT NULL,
    content    text      DEFAULT 'default'::text NOT NULL
);

CREATE UNIQUE INDEX utm_parameters_source_medium_campaign_term_content_uindex
    ON utm_parameters (source, medium, campaign, term, content);

site_tally 出于性能原因进行了专门分区，因为我们永远不需要查询多个 geo。

我遇到了一种极端情况，其中我们的一个查询需要很长时间才能运行:

SELECT SUM(views) AS views,
       term       AS utm
FROM site_tally
         INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
WHERE geo = 'UK'
    AND dt_tally >= '2019-08-01'
    AND dt_tally <= '2019-08-31'
    AND parent_site = 'site1'
    AND source = 'source1'
    AND medium = 'medium1'
    AND campaign = 'campaign1'
    AND tracked = FALSE
GROUP BY source,
         medium,
         campaign,
         term;

解释分析:

GroupAggregate  (cost=1.11..12152.56 rows=1 width=74) (actual time=88.064..163032.380 rows=351 loops=1)
"  Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign, utm_parameters.term"
  ->  Nested Loop  (cost=1.11..12152.53 rows=1 width=70) (actual time=59.993..163025.340 rows=15823 loops=1)
        ->  Index Scan using utm_parameters_source_medium_campaign_term_content_uindex on utm_parameters  (cost=0.55..8.57 rows=1 width=70) (actual time=0.024..39.883 rows=5994 loops=1)
              Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text) AND (campaign = 'campaign1'::text))
        ->  Append  (cost=0.56..12143.95 rows=1 width=8) (actual time=26.022..27.188 rows=3 loops=5994)
              ->  Index Scan using site_tally_uk_geo_dt_tally_parent_site_site_id_tracked_utm__idx on site_tally_uk  (cost=0.56..12143.95 rows=1 width=8) (actual time=26.020..27.185 rows=3 loops=5994)
                    Index Cond: ((geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (dt_tally <= '2019-08-31'::date) AND (parent_site = 'site1'::text) AND (tracked = false) AND (utm_parameter_id = utm_parameters.id))
                    Filter: (NOT tracked)
Planning Time: 0.693 ms
Execution Time: 163032.762 ms

在这种特殊情况下，有许多 term 可供分组，没有 term 的查询的行为非常不同:

SELECT SUM(views) AS views,
                       campaign   AS utm
                FROM site_tally
                         INNER JOIN utm_parameters ON (utm_parameters.id = utm_parameter_id)
                WHERE geo = 'UK'
                  AND dt_tally >= '2019-08-01'
                  AND dt_tally <= '2019-08-31'
                  AND parent_site = 'site1'
                  AND source = 'source1'
                  AND medium = 'medium1'
                  AND tracked = FALSE
                GROUP BY source,
                         medium,
                         campaign;

解释分析:

GroupAggregate  (cost=87129.06..87129.13 rows=3 width=48) (actual time=54.451..54.451 rows=1 loops=1)
"  Group Key: utm_parameters.source, utm_parameters.medium, utm_parameters.campaign"
  ->  Sort  (cost=87129.06..87129.07 rows=3 width=44) (actual time=50.572..51.398 rows=15823 loops=1)
        Sort Key: utm_parameters.campaign
        Sort Method: quicksort  Memory: 2610kB
        ->  Hash Join  (cost=1583.46..87129.04 rows=3 width=44) (actual time=11.359..46.521 rows=15823 loops=1)
              Hash Cond: (site_tally_uk.utm_parameter_id = utm_parameters.id)
              ->  Append  (cost=1322.54..86645.61 rows=84764 width=8) (actual time=4.268..31.765 rows=53612 loops=1)
                    ->  Bitmap Heap Scan on site_tally_uk  (cost=1322.54..86221.79 rows=84764 width=8) (actual time=4.267..28.157 rows=53612 loops=1)
                          Recheck Cond: ((dt_tally <= '2019-08-31'::date) AND (geo = 'UK'::text) AND (dt_tally >= '2019-08-01'::date) AND (parent_site = 'site1'::text) AND (NOT tracked))
                          Heap Blocks: exact=5237
                          ->  Bitmap Index Scan on site_tally_uk_geo_dt_tally_parent_site_tracked_idx  (cost=0.00..1301.35 rows=84764 width=0) (actual time=3.519..3.519 rows=53612 loops=1)
                                Index Cond: (dt_tally <= '2019-08-31'::date)
              ->  Hash  (cost=260.09..260.09 rows=66 width=44) (actual time=7.083..7.084 rows=5994 loops=1)
                    Buckets: 8192 (originally 1024)  Batches: 1 (originally 1)  Memory Usage: 556kB
                    ->  Bitmap Heap Scan on utm_parameters  (cost=5.23..260.09 rows=66 width=44) (actual time=1.346..5.862 rows=5994 loops=1)
                          Recheck Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text))
                          Heap Blocks: exact=2655
                          ->  Bitmap Index Scan on utm_parameters_source_medium_campaign_term_content_uindex  (cost=0.00..5.21 rows=66 width=0) (actual time=0.991..0.992 rows=5994 loops=1)
                                Index Cond: ((source = 'source1'::text) AND (medium = 'medium1'::text))
Planning Time: 0.571 ms
Execution Time: 54.773 ms

注意: site_tally 有更多基于整数的列(在 views 列之后)，这些列也用在 SELECT 中总和值。我决定将它们排除在问题之外，因为它已经很长了!

所以，我希望加快这个查询的速度，我尝试了另一种索引策略:

CREATE INDEX testing ON site_tally (geo, dt_tally, parent_site, tracked)
WHERE geo='UK' and dt_tally >= '2019-08-01' and parent_site='site1' and tracked=FALSE;

即使我尝试使用诸如 dt_tally > '2019-07-31' 之类的内容来具体说明查询，查询规划器也不会选择该索引。

此时，我无法更改 site_tally 上的唯一索引(其他查询依赖于该特定列顺序)

我想了解此查询中到底发生了什么(我不太熟悉 EXPLAIN 输出)。

最佳答案

看起来 utm_parameters 上索引扫描的估计值相当偏离。

首先，尝试一个简单的
```
ANALYZE utm_parameters;
```
看看这是否有效。

如果这没有改善问题，请尝试收集更详细的统计数据:

ALTER TABLE utm_parameters
   ALTER source SET STATISTICS 1000,
   ALTER medium SET STATISTICS 1000,
   ALTER campaign SET STATISTICS 1000;

ANALYZE utm_parameters;

如果这也不能改善估计值，则问题可能是列之间的相关性。尝试创建扩展统计信息:

CREATE STATISTICS utm_parameters_stats (dependencies)
   ON source, medium, campaign FROM utm_parameters;

ANALYZE utm_parameters;

看来最后一个选项对你有用。那么发生了什么？

PostgreSQL 有相当好的统计数据来估计 column = value 形式的条件的选择性。
假设三个条件的选择性均为0.1，即过滤掉90%的行。由于不了解更多情况，PostgreSQL 假设这些条件在统计上是独立的，因此它假设所有三个条件的选择性合计为 0.1 * 0.1 * 0.1 = 0.001。
现在，条件不是独立的，例如，如果两行的campaign相同，则medium对于行来说很可能也是相同的。所以 PostgreSQL 的估计会比实际情况低很多。
这个较低的估计导致 PostgreSQL 选择嵌套循环连接，这是小型外部表的最佳访问路径。但是如果外表很大，嵌套循环连接的性能会很差。因此修正估计可以提高性能。

关于postgresql - 大连接的查询优化，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57757206/

postgresql - 大连接的查询优化

上一篇：postgresql - 如何使用 ~ 模糊匹配表的两个字段？

下一篇：postgresql - 如何在 PostgreSQL 9.4 上将 int4range 类型的列转换为整数？