sql - 从状态变化历史记录中获取每天的用户状态

标签 sql postgresql greatest-n-per-group

我使用 postgres 并有一个不平凡的查询。我有两个解决方案,问题是它们速度不快。

有一个表user_status_changes,这是用户状态更改的历史记录

 user_id |         created_at  | from_status | to_status
---------+---------------------+-------------+-----------
       3 | 2016-03-24 04:00:00 | active      | pending
       3 | 2016-03-27 19:59:21 | pending     | banned
       6 | 2016-03-16 10:00:00 | pending     | active
       6 | 2016-03-21 15:00:00 | active      | banned
       6 | 2016-03-25 19:52:46 | banned      | pending
       6 | 2016-03-25 20:53:22 | pending     | canceled

用户

id |         created_at
----+----------------------------
  3 | 2016-03-21 19:54:09.831252
  6 | 2016-03-14 13:04:09.134358

我想要得到的是从 user.created_at 到今天的每一天的用户状态列表,其中包含前一天的日期和用户状态。

结果示例(假设今天是 2016 年 3 月 27 日):

 user_id   | date        | status_at | previous_status
-----------+-------------+-----------+-----------------
         3 | 2016-03-21  |           |
         3 | 2016-03-22  |           |
         3 | 2016-03-23  |           |
         3 | 2016-03-24  | pending   |
         3 | 2016-03-25  | pending   | pending
         3 | 2016-03-26  | pending   | pending
         3 | 2016-03-27  | banned    | pending
         6 | 2016-03-14  |           | 
         6 | 2016-03-15  |           | 
         6 | 2016-03-16  | active    | 
         6 | 2016-03-17  | active    | active
         6 | 2016-03-18  | active    | active
         6 | 2016-03-19  | active    | active
         6 | 2016-03-20  | active    | active
         6 | 2016-03-21  | banned    | active
         6 | 2016-03-22  | banned    | banned
         6 | 2016-03-23  | banned    | banned
         6 | 2016-03-24  | banned    | banned
         6 | 2016-03-25  | canceled  | banned
         6 | 2016-03-26  | canceled  | canceled
         6 | 2016-03-27  | canceled  | canceled

我想到了两个解决方案。带有子查询的一个(相当慢)

WITH possible_dates AS (
  SELECT date(generate_series) AS "date"
    FROM generate_series(
      (SELECT min(created_at) FROM users)::date,
      '2016-03-27'::date,
      '1 day'
    )
)
SELECT 
  user_id,
  possible_dates.date,
  (
    SELECT to_status 
    FROM user_status_changes 
    WHERE user_status_changes.user_id = users.user_id
      AND date(user_status_changes.created_at) <= possible_dates.date
    ORDER BY user_status_changes.created_at DESC
    LIMIT 1
  ) AS status_at,
  LAG(
      SELECT to_status 
      FROM user_status_changes 
      WHERE user_status_changes.user_id = users.user_id
        AND date(user_status_changes.created_at) <= possible_dates.date
      ORDER BY user_status_changes.created_at DESC
      LIMIT 1
    ) OVER (PARTITION BY users.user_id ORDER BY possible_dates.date ASC) AS previous_status
FROM users
CROSS JOIN possible_dates
WHERE date(users.created_at) <= possible_dates.date

另一个通过连接(似乎更快):

WITH status_changes AS (
  SELECT
    DISTINCT ON(user_id, date)
    user_id,
    created_at::date AS date,
    to_status,
    from_status
  FROM user_status_changes
  ORDER BY user_id, date, created_at DESC
),
possible_dates AS (
  SELECT date(generate_series) AS "date"
        FROM generate_series(
          (SELECT min(created_at) FROM users)::date,
          '2016-03-27'::date,
          '1 day'
        )
)
SELECT
  DISTINCT ON (users.user_id, possible_dates.date)
  users.user_id AS user_id,
  possible_dates.date AS date,
  s1.to_status AS status_at,
  s2.to_status AS previous_status
FROM users
CROSS JOIN possible_dates
LEFT OUTER JOIN status_changes s1
   ON s1.date <= possible_dates.date
  AND s1.user_id = users.user_id
LEFT JOIN LATERAL (
  SELECT
    status_changes.to_status,
    status_changes.date
  FROM status_changes
  WHERE
    status_changes.date < possible_dates.date AND
    status_changes.user_id = users.user_id
) s2 ON true
WHERE date(users.created_at) <= possible_dates.date
ORDER BY users.user_id, possible_dates.date DESC, s1.date DESC, s2.date DESC;

目前,我们拥有约 20,000 名用户,每个用户每月约有 10 次付款和 2 次状态更改。第一个用户是在一年前创建的。

我认为联接方法的问题在于,我们联接了所有以前的状态更改,然后才通过 DISTINCT ON 删除冗余。

任何更好的解决方案将不胜感激,也欢迎索引建议。

最佳答案

我的查询不使用LATERAL,它需要像您或@Mike 那样计算每一行,因此这应该要快得多。

说明

首先像您已经做的那样生成数据集。 CTE: generate_dates

然后将输出限制为每个用户的创建日期,并获取在这些日期设置的状态。 CTE: basic_status

在内部选择中,使用LEFT JOINCOALESCE()当时发生的状态填充每个状态之间的空值,并限制输出抛出使用 DISTINCT ON 将日期之后的所有状态设置为最接近的状态。

外部选择仅用于使用LAG()窗口函数计算先前的状态。

查询

WITH generate_dates AS (
SELECT date(generate_series) AS date
    FROM generate_series(
      (SELECT min(created_at) FROM users)::date,
      '2016-03-27'::date,
      '1 day'
    )
)
, basic_status AS (
SELECT 
  u.id AS user_id, 
  g.date,
  s.to_status AS status_at,
  row_number() OVER (PARTITION BY u.id ORDER BY g.date) AS rownum
FROM users u
JOIN generate_dates g ON
  g.date > u.created_at - interval '1 day'
LEFT JOIN user_status_changes s ON
  u.id = s.user_id
  AND s.created_at BETWEEN g.date AND g.date + interval '1 day'
)
SELECT 
  *,
  LAG(status_at) OVER (PARTITION BY user_id ORDER BY date) AS previous_status
FROM (
  SELECT 
    DISTINCT ON ( b1.user_id, b1.date )
    b1.user_id,
    b1.date,
    COALESCE(b1.status_at, b2.status_at) AS status_at
  FROM basic_status b1
  LEFT JOIN basic_status b2 ON
    b1.user_id = b2.user_id
    AND b1.status_at IS NULL
    AND b2.status_at IS NOT NULL
    AND b1.rownum > b2.rownum
  ORDER BY b1.user_id, b1.date DESC, b2.rownum DESC
  ) foo;

索引

您可以创建以下索引来加快速度:

  • 用户(id)
  • user_status_changes(user_id,created_at)
  • users(created_at) - 这可能没那么重要

注释

请记住使用ANALYZE 表更新您的统计信息,以便更准确地估算成本。

关于sql - 从状态变化历史记录中获取每天的用户状态,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39512688/

相关文章:

SQL 查询仅返回每个组 ID 1 条记录

postgresql - 源驱动程序 : unknown driver postgres (forgotten import? ) 即使正在导入 lib/pq

mysql - 在 JOIN 中使用 DISTINCT 会造成麻烦

mysql - 选择带有 IN 子句的查询 - 在 IN 子句中有重复值

mysql - 从子查询中选择列

php - 将 docker-compose.yml 中的包安装到 docker 容器中

python - 如何将字典项插入 PostgreSQL 表

sql - 每个 GROUP BY LIMIT SQL 查询

mysql - 查询每组最大 n 个不同的问题

sql - 如何在sql server中设置两个具有相同值的局部变量?