我接到一项任务来优化以下查询(不是我写的)
SELECT
"u"."email" as email,
r.url as "domain",
"u"."id" as "requesterId",
s.total * 100 / count("r"."id") as "rate",
count(("r"."url", "u"."email", "u"."id", s."total")) OVER () as total
FROM
(
SELECT
url,
id,
"requesterId",
created_at
FROM
(
SELECT
url,
id,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
id,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "r"
INNER JOIN (
SELECT
"requesterId",
url,
count(created_at) AS "total"
FROM
(
SELECT
url,
status,
created_at,
"requesterId"
FROM
(
SELECT
url,
status,
created_at,
"requesterId",
row_number() over (partition by main_request_uuid) as row_number
FROM
"requests" "request"
GROUP BY
main_request_uuid,
retry_number,
url,
status,
created_at,
"requesterId"
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE
request_.row_number = 1
) "s"
WHERE
status IN ('success')
AND s."created_at" :: date >= '2022-01-07' :: date
AND s."created_at" :: date <= '2022-02-07' :: date
GROUP BY
s.url,
s."requesterId"
) "s" ON s."requesterId" = r."requesterId"
AND s.url = r.url
INNER JOIN "users" "u" ON "u"."id" = r."requesterId"
WHERE
r."created_at" :: date >= '2022-01-07' :: date
AND r."created_at" :: date <= '2022-02-07' :: date
GROUP BY
r.url,
"u"."email",
"u"."id",
s.total
LIMIT
10
所以有 requests
表,其中存储一些 API 请求,并且有一种在请求失败时重试请求的机制,该机制会重复 5 次,同时为每次重试保留单独的行。如果5次后仍然失败,则不再继续。这就是 partition by
的原因子查询,仅选择主要请求。
查询应返回的是请求总数和成功率,按 url
分组。和requesterId
。我得到的查询不仅是错误的,而且执行起来需要大量时间,所以我想出了下面的优化版本
WITH a AS (SELECT url,
id,
status,
"requesterId",
created_at
FROM (
SELECT url,
id,
status,
"requesterId",
created_at,
row_number() over (partition by main_request_uuid) as row_number
FROM "requests" "request"
WHERE
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
GROUP BY main_request_uuid,
retry_number,
url,
id,
status,
"requesterId",
created_at
ORDER BY
main_request_uuid ASC,
retry_number DESC
) "request_"
WHERE request_.row_number = 1),
b AS (SELECT count(*) total, a2.url as url, a2."requesterId" FROM a a2 GROUP BY a2.url, a2."requesterId"),
c AS (SELECT count(*) success, a3.url as url, a3."requesterId"
FROM a a3
WHERE status IN ('success')
GROUP BY a3.url, a3."requesterId")
SELECT success * 100 / total as rate, b.url, b."requesterId", total, email
FROM b
JOIN c ON b.url = c.url AND b."requesterId" = c."requesterId" JOIN users u ON b."requesterId" = u.id
LIMIT 10;
新版本基本上所做的就是选择所有主要请求,并统计成功的请求和总数。新版本仍然需要大量执行时间(在有 400 万个请求的表上大约需要 60 秒)。
有没有办法进一步优化?
您可以看到下面的表结构。该表没有相关索引,但在(url, requesterId)
上加1没有效果
这是一个有 100k 行的备份表的执行计划。 100k 行需要 1.1 秒,但在这种情况下,更希望至少将其减少到 200 毫秒
Limit (cost=15196.40..15204.56 rows=1 width=77) (actual time=749.664..1095.476 rows=10 loops=1)
CTE a
-> Subquery Scan on request_ (cost=15107.66..15195.96 rows=3 width=159) (actual time=226.805..591.188 rows=49474 loops=1)
Filter: (request_.row_number = 1)
Rows Removed by Filter: 70962
-> WindowAgg (cost=15107.66..15188.44 rows=602 width=206) (actual time=226.802..571.185 rows=120436 loops=1)
-> Group (cost=15107.66..15179.41 rows=602 width=198) (actual time=226.797..435.340 rows=120436 loops=1)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Gather Merge (cost=15107.66..15170.62 rows=502 width=198) (actual time=226.795..386.198 rows=120436 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Group (cost=14107.64..14112.66 rows=251 width=198) (actual time=212.749..269.504 rows=40145 loops=3)
" Group Key: request.main_request_uuid, request.retry_number, request.url, request.id, request.status, request.""requesterId"", request.created_at"
-> Sort (cost=14107.64..14108.27 rows=251 width=198) (actual time=212.744..250.031 rows=40145 loops=3)
" Sort Key: request.main_request_uuid, request.retry_number DESC, request.url, request.id, request.status, request.""requesterId"", request.created_at"
Sort Method: external merge Disk: 7952kB
Worker 0: Sort Method: external merge Disk: 8568kB
Worker 1: Sort Method: external merge Disk: 9072kB
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
-> Nested Loop (cost=0.43..8.59 rows=1 width=77) (actual time=749.662..1095.364 rows=10 loops=1)
" Join Filter: (a2.""requesterId"" = u.id)"
-> Nested Loop (cost=0.16..0.28 rows=1 width=64) (actual time=749.630..1095.163 rows=10 loops=1)
" Join Filter: (((a2.url)::text = (a3.url)::text) AND (a2.""requesterId"" = a3.""requesterId""))"
Rows Removed by Join Filter: 69
-> HashAggregate (cost=0.08..0.09 rows=1 width=48) (actual time=703.128..703.139 rows=10 loops=1)
" Group Key: a3.url, a3.""requesterId"""
Batches: 5 Memory Usage: 4297kB Disk Usage: 7040kB
-> CTE Scan on a a3 (cost=0.00..0.07 rows=1 width=40) (actual time=226.808..648.251 rows=41278 loops=1)
Filter: (status = 'success'::requests_status_enum)
Rows Removed by Filter: 8196
-> HashAggregate (cost=0.08..0.11 rows=3 width=48) (actual time=38.103..38.105 rows=8 loops=10)
" Group Key: a2.url, a2.""requesterId"""
Batches: 41 Memory Usage: 4297kB Disk Usage: 7328kB
-> CTE Scan on a a2 (cost=0.00..0.06 rows=3 width=40) (actual time=0.005..7.419 rows=49474 loops=10)
" -> Index Scan using ""PK_a3ffb1c0c8416b9fc6f907b7433"" on users u (cost=0.28..8.29 rows=1 width=29) (actual time=0.015..0.015 rows=1 loops=10)"
" Index Cond: (id = a3.""requesterId"")"
Planning Time: 1.494 ms
Execution Time: 1102.488 ms
最佳答案
您的计划中的这些内容指出了可能的优化。
-> Parallel Seq Scan on requests request (cost=0.00..14097.63 rows=251 width=198) (actual time=0.024..44.013 rows=40145 loops=3)
Filter: (((created_at)::date >= '2022-01-07'::date) AND ((created_at)::date <= '2022-02-07'::date))
顺序扫描,无论是否并行,都有些昂贵。
因此,尝试更改这些 WHERE 条件以使它们成为 sargable对于范围扫描很有用。
created_at:: date >= '2022-01-07' :: date
AND created_at :: date <= '2022-02-07' :: date
将这些更改为
created_at >= '2022-01-07' :: date
AND created_at < '2022-01-07' :: date + INTERVAL '1' DAY
并且,在 created_at
列上放置一个 BTREE 索引。
CREATE INDEX ON requests (created_at);
您的查询很复杂,所以我不完全确定这是否有效。但尝试一下。索引应该只提取您需要的日期的行。
而且,没有附带 ORDER BY
子句的 LIMIT
子句赋予 postgreSQL 权限,可以从结果集中返回它想要的任何 10 行。请勿在没有 ORDER BY
的情况下使用 LIMIT
。除非需要,否则根本不要这样做。
关于sql - 如何优化 GROUP BY 查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/71034838/