sql - 如何优化 GROUP BY 查询

标签 sql postgresql query-optimization

我接到一项任务来优化以下查询(不是我写的)

SELECT
  "u"."email" as email,
  r.url as "domain",
  "u"."id" as "requesterId",
  s.total * 100 / count("r"."id") as "rate",
  count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
  (
    SELECT
      url,
      id,
      "requesterId",
      created_at
    FROM
      (
        SELECT
          url,
          id,
          "requesterId",
          created_at,
          row_number() over (partition by main_request_uuid) as row_number
        FROM
          "requests" "request"
        GROUP BY
          main_request_uuid,
          retry_number,
          url,
          id,
          "requesterId",
          created_at
        ORDER BY
          main_request_uuid ASC,
          retry_number DESC
      ) "request_"
    WHERE
      request_.row_number = 1
  ) "r"
  INNER JOIN (
    SELECT
      "requesterId",
      url,
      count(created_at) AS "total"
    FROM
      (
        SELECT
          url,
          status,
          created_at,
          "requesterId"
        FROM
          (
            SELECT
              url,
              status,
              created_at,
              "requesterId",
              row_number() over (partition by main_request_uuid) as row_number
            FROM
              "requests" "request"
            GROUP BY
              main_request_uuid,
              retry_number,
              url,
              status,
              created_at,
              "requesterId"
            ORDER BY
              main_request_uuid ASC,
              retry_number DESC
          ) "request_"
        WHERE
          request_.row_number = 1
      ) "s"
    WHERE
      status IN ('success')
      AND s."created_at" :: date >= '2022-01-07' :: date
      AND s."created_at" :: date <= '2022-02-07' :: date
    GROUP BY
      s.url,
      s."requesterId"
  ) "s" ON s."requesterId" = r."requesterId"
  AND s.url = r.url
  INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
  r."created_at" :: date >= '2022-01-07' :: date
  AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
  r.url,
  "u"."email",
  "u"."id",
  s.total
LIMIT
  10

所以有 requests表,其中存储一些 API 请求,并且有一种在请求失败时重试请求的机制,该机制会重复 5 次,同时为每次重试保留单独的行。如果5次后仍然失败,则不再继续。这就是 partition by 的原因子查询,仅选择主要请求。

查询应返回的是请求总数和成功率,按 url 分组。和requesterId 。我得到的查询不仅是错误的,而且执行起来需要大量时间,所以我想出了下面的优化版本

WITH a AS (SELECT url,
                  id,
                  status,
                  "requesterId",
                  created_at
           FROM (
                    SELECT url,
                           id,
                           status,
                           "requesterId",
                           created_at,
                           row_number() over (partition by main_request_uuid) as row_number
                    FROM "requests" "request"
                    WHERE
                    created_at:: date >= '2022-01-07' :: date
                    AND created_at :: date <= '2022-02-07' :: date
                    GROUP BY main_request_uuid,
                             retry_number,
                             url,
                             id,
                             status,
                             "requesterId",
                             created_at
                    ORDER BY
                             main_request_uuid ASC,
                             retry_number DESC
                ) "request_"
           WHERE request_.row_number = 1),
     b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
     c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
           FROM a a3
           WHERE status IN ('success')
           GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
         JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;

新版本基本上所做的就是选择所有主要请求,并统计成功的请求和总数。新版本仍然需要大量执行时间(在有 400 万个请求的表上大约需要 60 秒)。

有没有办法进一步优化?

您可以看到下面的表结构。该表没有相关索引,但在(url, requesterId)上加1没有效果

<表类=“s-表”> <标题> 列名 数据类型 <正文> id bigint 请求者ID bigint 代理ID bigint 网址 字符变化 状态 用户定义 花费的时间 整数 创建于 带时区的时间戳 请求信息 jsonb 重试次数 小整数 main_request_uuid 字符变化

这是一个有 100k 行的备份表的执行计划。 100k 行需要 1.1 秒,但在这种情况下,更希望至少将其减少到 200 毫秒

Limit  (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
  CTE a
    ->  Subquery Scan on request_  (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
          Filter: (request_.row_number = 1)
          Rows Removed by Filter: 70962
          ->  WindowAgg  (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
                ->  Group  (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
"                      Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
                      ->  Gather Merge  (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
                            Workers Planned: 2
                            Workers Launched: 2
                            ->  Group  (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
"                                  Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
                                  ->  Sort  (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
"                                        Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
                                        Sort Method: external merge  Disk: 7952kB
                                        Worker 0:  Sort Method: external merge  Disk: 8568kB
                                        Worker 1:  Sort Method: external merge  Disk: 9072kB
                                        ->  Parallel Seq Scan on requests request  (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
                                              Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
  ->  Nested Loop  (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
"        Join Filter: (a2.""requesterId"" = u.id)"
        ->  Nested Loop  (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
"              Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
              Rows Removed by Join Filter: 69
              ->  HashAggregate  (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
"                    Group Key: a3.url, a3.""requesterId"""
                    Batches: 5  Memory Usage: 4297kB  Disk Usage: 7040kB
                    ->  CTE Scan on a a3  (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
                          Filter: (status = 'success'::requests_status_enum)
                          Rows Removed by Filter: 8196
              ->  HashAggregate  (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
"                    Group Key: a2.url, a2.""requesterId"""
                    Batches: 41  Memory Usage: 4297kB  Disk Usage: 7328kB
                    ->  CTE Scan on a a2  (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
"        ->  Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u  (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
"              Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms

最佳答案

您的计划中的这些内容指出了可能的优化。

->  Parallel Seq Scan on requests request  (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
    Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))

顺序扫描,无论是否并行,都有些昂贵。

因此,尝试更改这些 WHERE 条件以使它们成为 sargable对于范围扫描很有用。

    created_at:: date >= '2022-01-07' :: date 
AND created_at :: date <= '2022-02-07' :: date

将这些更改为

    created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date + INTERVAL '1' DAY

并且,在 created_at 列上放置一个 BTREE 索引。

CREATE INDEX ON requests (created_at);

您的查询很复杂,所以我不完全确定这是否有效。但尝试一下。索引应该只提取您需要的日期的行。

而且,没有附带 ORDER BY 子句的 LIMIT 子句赋予 postgreSQL 权限,可以从结果集中返回它想要的任何 10 行。请勿在没有 ORDER BY 的情况下使用 LIMIT。除非需要,否则根本不要这样做。

关于sql - 如何优化 GROUP BY 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71034838/

相关文章:

mysql - 计算两条记录并将它们连接到同一个 MySQL 行中

java - 如何使用数据库值自动填充jsp文本框而不使用按钮或提交?

optimization - 在 MySQL 中高效查询 15,000,000 行表

sql - 分组依据和内连接 : how to select joined without a "max" trick

postgresql - 大表查询的第一次调用出奇慢

mysql - 如果 MySQL 对 AND 条件使用索引,为什么或何时不对 OR 条件使用索引?

sql - 子查询返回多个值如何处理

mysql - 通过 join 在 mysql 上排名

sql - Bigquery SQL 用于生成混合级因子设计数组 - 笛卡尔积

postgresql - Gorm Jsonb 类型存储为 bytea