我需要找到一种有效的方法来创建一个查询,该查询报告聚合的增量,以及值的开始和结束日期。
要求
- 源表包括开始日期、结束日期、类别 ID、子类别 ID 以及子类别是否处于事件状态的指示符。
- 聚合是针对 cat_id 上的 is_active,只要任何子类别对于 is_active 也是 1,则函数的结果应该为 1。
- 如果连续日期范围的聚合函数结果相同,则应合并日期范围以减少结果集。
- 类别/子类别组合永远不会有重叠的日期,但其他子类别可能会跨越彼此的边界。
我尝试过的
我尝试创建一个 CTE,为一个类别生成所有可能的范围,然后连接回主查询,以便分解一个跨越多个范围的子类别。然后我按范围分组并做了 MAX(is_active)。
虽然这是一个好的开始(此时我需要做的就是将具有相同值的连续范围组合起来),但查询非常慢。我对 Postgres 不像对其他类型的 SQL 那样熟悉,因此决定我的时间最好花在接触更有经验的人并获得帮助上。
源数据
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| id | start_dt | end_dt | cat_id | sub_cat_id | is_active | comment |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
| 1 | 2018-01-01 | 2018-01-31 | 1 | 1001 | 1 | (null) |
| 2 | 2018-02-01 | 2018-02-14 | 1 | 1001 | 0 | (null) |
| 3 | 2018-02-15 | 2018-02-28 | 1 | 1001 | 0 | cat 1 is_active is unchanged despite new record. |
| 4 | 2018-03-01 | 2018-03-30 | 1 | 1001 | 1 | (null) |
| 5 | 2018-01-01 | 2018-01-15 | 2 | 2001 | 1 | (null) |
| 6 | 2018-01-01 | 2018-01-31 | 2 | 2002 | 1 | (null) |
| 7 | 2018-01-15 | 2018-02-10 | 2 | 2001 | 0 | cat 2 should still be active until 2002 is inactive |
| 8 | 2018-02-01 | 2018-02-14 | 2 | 2002 | 0 | cat 2 is inactive |
| 9 | 2018-02-10 | 2018-03-15 | 2 | 2001 | 0 | this record will cause trouble |
| 10 | 2018-02-15 | 2018-03-30 | 2 | 2002 | 1 | cat 2 should be active again |
| 11 | 2018-03-15 | 2018-03-30 | 2 | 2001 | 1 | cat 2 is_active is unchanged despite new record. |
| 12 | 2018-04-01 | 2018-04-30 | 2 | 2001 | 0 | cat 2 ends in a zero |
+----+------------+------------+--------+------------+-----------+-----------------------------------------------------+
预期结果
+------------+------------+--------+-----------+
| start_dt | end_dt | cat_id | is_active |
+------------+------------+--------+-----------+
| 2018-01-01 | 2018-01-31 | 1 | 1 |
| 2018-02-01 | 2018-02-28 | 1 | 0 |
| 2018-03-01 | 2018-03-30 | 1 | 1 |
| 2018-01-01 | 2018-01-31 | 2 | 1 |
| 2018-02-01 | 2018-02-14 | 2 | 0 |
| 2018-02-15 | 2018-03-30 | 2 | 1 |
| 2018-04-01 | 2018-04-30 | 2 | 0 |
+------------+------------+--------+-----------+
这里有一个 select 语句可以帮助您编写自己的测试。
SELECT id,start_dt::date start_date,end_dt::date end_date,cat_id,sub_cat_id,is_active::int is_active,comment
FROM (VALUES
(1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
(2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
(3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
(4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
(5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
(6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
(7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
(8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
(9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
(10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
(11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
(12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
) src ( "id","start_dt","end_dt","cat_id","sub_cat_id","is_active","comment" )
最佳答案
WITH test AS (
SELECT id, start_dt::date, end_dt::date, cat_id, sub_cat_id, is_active::int, comment FROM ( VALUES
(1, '2018-01-01', '2018-01-31', 1, 1001, '1', null),
(2, '2018-02-01', '2018-02-14', 1, 1001, '0', null),
(3, '2018-02-15', '2018-02-28', 1, 1001, '0', 'cat 1 is_active is unchanged despite new record.'),
(4, '2018-03-01', '2018-03-30', 1, 1001, '1', null),
(5, '2018-01-01', '2018-01-15', 2, 2001, '1', null),
(6, '2018-01-01', '2018-01-31', 2, 2002, '1', null),
(7, '2018-01-15', '2018-02-10', 2, 2001, '0', 'cat 2 should still be active until 2002 is inactive'),
(8, '2018-02-01', '2018-02-14', 2, 2002, '0', 'cat 2 is inactive'),
(9, '2018-02-10', '2018-03-15', 2, 2001, '0', 'cat 2 is_active is unchanged despite new record.'),
(10, '2018-02-15', '2018-03-30', 2, 2002, '1', 'cat 2 should be active agai'),
(11, '2018-03-15', '2018-03-30', 2, 2001, '1', 'cat 2 is_active is unchanged despite new record.'),
(12, '2018-04-01', '2018-04-30', 2, 2001, '0', 'cat 2 ends in 0.')
) test (id, start_dt, end_dt, cat_id, sub_cat_id, is_active, comment)
)
SELECT cat_id, start_date, end_date, active_state
FROM (
SELECT cat_id, date as start_date, lead(date-1) over w as end_date
, active_state, prev_active
, nonactive_state, prev_nonactive
FROM (
SELECT cat_id, date
, active_state, prev_active
, nonactive_state
, lag(nonactive_state, 1, 0) over w as prev_nonactive
FROM (
SELECT cat_id, date, active_state, lag(active_state, 1, 0) over w as prev_active
, (nonactive_state > active_state)::int as nonactive_state
FROM (
SELECT DISTINCT ON (cat_id, date)
cat_id, date
, (CASE WHEN sum(type) over w > 0 THEN 1 ELSE 0 END) as active_state
, (CASE WHEN sum(nonactive_type) over w > 0 THEN 1 ELSE 0 END) as nonactive_state
FROM (
SELECT start_dt as date
, 1 as type
, cat_id
, 0 as nonactive_type
FROM test
WHERE is_active = 1
UNION ALL
SELECT end_dt + 1 as date
, -1 as type
, cat_id
, 0 as nonactive_type
FROM test
WHERE is_active = 1
UNION ALL
SELECT start_dt as date
, 0 as type
, cat_id
, 1 as nonactive_type
FROM test
WHERE is_active = 0
UNION ALL
SELECT end_dt + 1 as date
, 0 as type
, cat_id
, -1 as nonactive_type
FROM test
WHERE is_active = 0
) t
WINDOW w as (partition by cat_id order by date)
ORDER BY cat_id, date
) t2
WINDOW w as (partition by cat_id order by date)
) t3
WINDOW w as (partition by cat_id order by date)
) t4
WHERE (active_state != prev_active) OR (nonactive_state != prev_nonactive)
WINDOW w as (partition by cat_id order by date)
) t5
WHERE active_state = 1 OR nonactive_state = 1
ORDER BY cat_id, start_date
产量
| cat_id | start_date | end_date | active_state |
|--------+------------+------------+--------------|
| 1 | 2018-01-01 | 2018-01-31 | 1 |
| 1 | 2018-02-01 | 2018-02-28 | 0 |
| 1 | 2018-03-01 | 2018-03-30 | 1 |
| 2 | 2018-01-01 | 2018-01-31 | 1 |
| 2 | 2018-02-01 | 2018-02-14 | 0 |
| 2 | 2018-02-15 | 2018-03-30 | 1 |
| 2 | 2018-04-01 | 2018-04-30 | 0 |
这将 start_dt
和 end_dt
日期合并到一个列中,并且
引入了一个 type
列,开始日期为 1,结束日期为 -1。
对 type
求和产生一个正值,当
对应的date
在[start_dt, end_dt]
区间内,为0
否则。
这是 Itzik Ben-Gan 的 Packing Intervals 中提出的想法之一。 , 但我先 从 DSM 学到的(在 Python/Pandas 编程的上下文中) here .
通常在使用上述技术处理区间时,区间
定义日期何时处于“开启”状态,而不是“开启”自动意味着“关闭”。
然而,在这个问题中,它出现
active_state = 1
的行表示最终 active_state
为“开启”,但这些间隔之外的日期不一定为“关闭”。 2018-03-31
是外部日期的示例
active_state = 1
间隔但不是“关闭”。
类似地,active_state = 0
的行暗示最终 active_state
为“关闭”,只要日期不与 active_state = 1
的间隔相交.
为了处理这两种不同类型的间隔,我两次应用上述技术(求和 +1/-1 type
):一次用于 is_active = 1
的行一次用于 is_active = 0
的行。
这为我们提供了确定处于 active_state
(“on”)和确定处于 nonactive_state
(“off”)的日期的句柄。
由于活跃胜于非活跃,被视为非活跃的日期使用以下方法进行修剪:
(nonactive_state > active_state)::int as nonactive_state
(即当active_state = 1
和nonactive_state = 1
时,上面的赋值用于将nonactive_state
变为 0
。)
关于sql - 获取具有日期范围的自定义聚合的增量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55151513/