我使用 postgres 并有一个不平凡的查询。我有两个解决方案,问题是它们速度不快。
有一个表user_status_changes
,这是用户状态更改的历史记录
user_id | created_at | from_status | to_status
---------+---------------------+-------------+-----------
3 | 2016-03-24 04:00:00 | active | pending
3 | 2016-03-27 19:59:21 | pending | banned
6 | 2016-03-16 10:00:00 | pending | active
6 | 2016-03-21 15:00:00 | active | banned
6 | 2016-03-25 19:52:46 | banned | pending
6 | 2016-03-25 20:53:22 | pending | canceled
用户
id | created_at
----+----------------------------
3 | 2016-03-21 19:54:09.831252
6 | 2016-03-14 13:04:09.134358
我想要得到的是从 user.created_at
到今天的每一天的用户状态列表,其中包含前一天的日期和用户状态。
结果示例(假设今天是 2016 年 3 月 27 日):
user_id | date | status_at | previous_status
-----------+-------------+-----------+-----------------
3 | 2016-03-21 | |
3 | 2016-03-22 | |
3 | 2016-03-23 | |
3 | 2016-03-24 | pending |
3 | 2016-03-25 | pending | pending
3 | 2016-03-26 | pending | pending
3 | 2016-03-27 | banned | pending
6 | 2016-03-14 | |
6 | 2016-03-15 | |
6 | 2016-03-16 | active |
6 | 2016-03-17 | active | active
6 | 2016-03-18 | active | active
6 | 2016-03-19 | active | active
6 | 2016-03-20 | active | active
6 | 2016-03-21 | banned | active
6 | 2016-03-22 | banned | banned
6 | 2016-03-23 | banned | banned
6 | 2016-03-24 | banned | banned
6 | 2016-03-25 | canceled | banned
6 | 2016-03-26 | canceled | canceled
6 | 2016-03-27 | canceled | canceled
我想到了两个解决方案。带有子查询的一个(相当慢)
WITH possible_dates AS (
SELECT date(generate_series) AS "date"
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
SELECT
user_id,
possible_dates.date,
(
SELECT to_status
FROM user_status_changes
WHERE user_status_changes.user_id = users.user_id
AND date(user_status_changes.created_at) <= possible_dates.date
ORDER BY user_status_changes.created_at DESC
LIMIT 1
) AS status_at,
LAG(
SELECT to_status
FROM user_status_changes
WHERE user_status_changes.user_id = users.user_id
AND date(user_status_changes.created_at) <= possible_dates.date
ORDER BY user_status_changes.created_at DESC
LIMIT 1
) OVER (PARTITION BY users.user_id ORDER BY possible_dates.date ASC) AS previous_status
FROM users
CROSS JOIN possible_dates
WHERE date(users.created_at) <= possible_dates.date
另一个通过连接(似乎更快):
WITH status_changes AS (
SELECT
DISTINCT ON(user_id, date)
user_id,
created_at::date AS date,
to_status,
from_status
FROM user_status_changes
ORDER BY user_id, date, created_at DESC
),
possible_dates AS (
SELECT date(generate_series) AS "date"
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
SELECT
DISTINCT ON (users.user_id, possible_dates.date)
users.user_id AS user_id,
possible_dates.date AS date,
s1.to_status AS status_at,
s2.to_status AS previous_status
FROM users
CROSS JOIN possible_dates
LEFT OUTER JOIN status_changes s1
ON s1.date <= possible_dates.date
AND s1.user_id = users.user_id
LEFT JOIN LATERAL (
SELECT
status_changes.to_status,
status_changes.date
FROM status_changes
WHERE
status_changes.date < possible_dates.date AND
status_changes.user_id = users.user_id
) s2 ON true
WHERE date(users.created_at) <= possible_dates.date
ORDER BY users.user_id, possible_dates.date DESC, s1.date DESC, s2.date DESC;
目前,我们拥有约 20,000 名用户,每个用户每月约有 10 次付款和 2 次状态更改。第一个用户是在一年前创建的。
我认为联接方法的问题在于,我们联接了所有以前的状态更改,然后才通过 DISTINCT ON
删除冗余。
任何更好的解决方案将不胜感激,也欢迎索引建议。
最佳答案
我的查询不使用LATERAL
,它需要像您或@Mike 那样计算每一行,因此这应该要快得多。
说明
首先像您已经做的那样生成数据集。 CTE: generate_dates
。
然后将输出限制为每个用户的创建日期,并获取在这些日期设置的状态。 CTE: basic_status
。
在内部选择中,使用LEFT JOIN
和COALESCE()
当时发生的状态填充每个状态之间的空值,并限制输出抛出使用 DISTINCT ON
将日期之后的所有状态设置为最接近的状态。
外部选择仅用于使用LAG()
窗口函数计算先前的状态。
查询
WITH generate_dates AS (
SELECT date(generate_series) AS date
FROM generate_series(
(SELECT min(created_at) FROM users)::date,
'2016-03-27'::date,
'1 day'
)
)
, basic_status AS (
SELECT
u.id AS user_id,
g.date,
s.to_status AS status_at,
row_number() OVER (PARTITION BY u.id ORDER BY g.date) AS rownum
FROM users u
JOIN generate_dates g ON
g.date > u.created_at - interval '1 day'
LEFT JOIN user_status_changes s ON
u.id = s.user_id
AND s.created_at BETWEEN g.date AND g.date + interval '1 day'
)
SELECT
*,
LAG(status_at) OVER (PARTITION BY user_id ORDER BY date) AS previous_status
FROM (
SELECT
DISTINCT ON ( b1.user_id, b1.date )
b1.user_id,
b1.date,
COALESCE(b1.status_at, b2.status_at) AS status_at
FROM basic_status b1
LEFT JOIN basic_status b2 ON
b1.user_id = b2.user_id
AND b1.status_at IS NULL
AND b2.status_at IS NOT NULL
AND b1.rownum > b2.rownum
ORDER BY b1.user_id, b1.date DESC, b2.rownum DESC
) foo;
索引
您可以创建以下索引来加快速度:
用户(id)
user_status_changes(user_id,created_at
)users(created_at)
- 这可能没那么重要
注释
请记住使用ANALYZE 表
更新您的统计信息,以便更准确地估算成本。
关于sql - 从状态变化历史记录中获取每天的用户状态,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39512688/