postgresql - 为什么等效的、更复杂的查询要快 10 倍

标签 postgresql group-by subquery postgresql-9.5

我有一张要整理的平板。该表基本上代表一个树结构:

channel -> (n0) 合作伙伴 -> (n1) 事件组 -> (n2) 事件 -> ... (ni) 其他级别

CREATE TABLE campaign_tree (
    channel_id int,
    channel_name text,
    partner_name text,
    campaign_group_name text,
    campaign_name text,
    ad_name text
);

为了清理数据、使名称不区分大小写并丢失冗余 ID,我首先找到需要更新的数据。所以我有两种方法来解决这个问题:

方法一
先获取上层树的结构,再去掉同名的不同ID:

SELECT
    count(1),
    min(campaign_id) AS new_campaign_id,
    campaign_name,
    channel_name,
    partner_name,
    campaign_group_name
FROM
(SELECT DISTINCT
    campaign_id,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name,
    upper(campaign_name) AS campaign_name
FROM
    campaign_tree
) tmp
GROUP BY channel_name, partner_name, campaign_group_name, campaign_name
HAVING count(1)>1 --only need to get those that we need to sanitize

执行此查询大约需要 350 毫秒。查询计划如下:

HashAggregate  (cost=18008.63..18081.98 rows=5868 width=136) (actual time=391.868..404.130 rows=33 loops=1)
  Output: count(1), min(campaign_tree.campaign_id), (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree_campaign_code.partner_name)), (upper(campaign_tree.campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(1) > 1)
  Rows Removed by Filter: 64855
  ->  Unique  (cost=15324.20..16394.93 rows=58680 width=83) (actual time=282.253..338.041 rows=64998 loops=1)
        Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=282.251..305.340 rows=71382 loops=1)
              Output: campaign_tree_campaign_code.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Key: campaign_tree.campaign_id, (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
              Sort Method: external merge  Disk: 6608kB
              ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.015..146.611 rows=71382 loops=1)
                    Output: campaign_tree.campaign_id, upper(campaign_tree.channel_name), upper(campaign_tree.partner_name), upper(campaign_tree.campaign_group_name), upper(campaign_tree.campaign_name)
Planning time: 0.085 ms
Execution time: 407.383 ms

方法二
一种直接的方法:计算具有相同名称的项目的不同 id。还要确定这些不同 ID 的最小 ID。

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
    upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY upper(channel_name), upper(partner_name), upper(campaign_group_name), upper(campaign_name)
HAVING count(distinct campaign_id)>1

结果是一样的,只是顺序不同。每次执行时间约为4秒。查询计划如下:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=3723.908..4004.447 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=3718.016..3934.400 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name)), (upper(campaign_tree.campaign_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..150.634 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.066 ms
Execution time: 4006.323 ms

方法 3
经过一番讨论,我决定尝试改变第二种方法,并引用表达式而不是在 GROUP BY 子句中显式编写它们:

SELECT
    count(distinct campaign_id) AS cnt,
    min(campaign_id) AS new_campaign_id,
    upper(campaign_name) AS campaign_name,
    upper(channel_name) AS channel_name,
    upper(partner_name) AS partner_name,
   upper(campaign_group_name) AS campaign_group_name
FROM campaign_tree
GROUP BY 3, 4, 5, 6
HAVING count(distinct campaign_id)>1

查询计划:

GroupAggregate  (cost=15324.20..17912.57 rows=51588 width=83) (actual time=1148.957..1316.564 rows=33 loops=1)
  Output: count(DISTINCT campaign_id), min(campaign_id), (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name))
  Group Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
  Filter: (count(DISTINCT campaign_tree.campaign_id) > 1)
  Rows Removed by Filter: 64855
  ->  Sort  (cost=15324.20..15502.65 rows=71382 width=83) (actual time=1148.849..1240.184 rows=71382 loops=1)
        Output: (upper(campaign_name)), (upper(channel_name)), (upper(partner_name)), (upper(campaign_group_name)), campaign_id
        Sort Key: (upper(campaign_tree.campaign_name)), (upper(campaign_tree.channel_name)), (upper(campaign_tree.partner_name)), (upper(campaign_tree.campaign_group_name))
        Sort Method: external merge  Disk: 6880kB
        ->  Seq Scan on campaign_tree  (cost=0.00..6153.64 rows=71382 width=83) (actual time=0.014..148.835 rows=71382 loops=1)
              Output: upper(campaign_name), upper(channel_name), upper(partner_name), upper(campaign_group_name), campaign_id
Planning time: 0.067 ms
Execution time: 1318.397 ms

不,没有在此表上创建索引。我知道他们会改善事情。这不是这个问题的重点。

问题是:为什么执行时间会有这么大的差异?查询计划对我没有任何启发。

最佳答案

阅读计划时,当您通过不同的 campaign_id 执行独特与组时,它们似乎有所不同。

这向我表明问题是 group by count(*) > 1(与您正在做的相同)比 group by count(distinct campaign_id)< 便宜得多

这是有道理的,因为您已经在前者中分组,而在第二个中您必须对第二个分组集进行二次计算。

关于postgresql - 为什么等效的、更复杂的查询要快 10 倍,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/40149367/

相关文章:

mysql - 将 JOIN 查询重写为子查询,缺少行?

mysql从子查询中获取列的总和,同时仍然返回所有记录

python - 如何仅使用 CASE 或 COALESCE 表达式来过滤 Django 查询集?

sql - 在嵌套子查询中访问聚合

mysql - 如何解决联合查询中记录数差异的问题

MySQL仅从列中选择某些值而不排除行

mysql - 星型模式聚合问题

sql - 根据创建的时间戳消除行

ruby - SEQUEL Postgres 连接查询

mysql - 在子查询中使用外部 MySQL 查询中的列