sql - 基于滚动群组的滚动计数

标签 sql postgresql crosstab window-functions generate-series

使用 Postgres 9.5。测试数据:

create temp table rental (
    customer_id smallint
    ,rental_date timestamp without time zone
    ,customer_name text
);

insert into rental values
    (1, '2006-05-01', 'james'),
    (1, '2006-06-01', 'james'),
    (1, '2006-07-01', 'james'),
    (1, '2006-07-02', 'james'),
    (2, '2006-05-02', 'jacinta'),
    (2, '2006-05-03', 'jacinta'),
    (3, '2006-05-04', 'juliet'),
    (3, '2006-07-01', 'juliet'),
    (4, '2006-05-03', 'julia'),
    (4, '2006-06-01', 'julia'),
    (5, '2006-05-05', 'john'),
    (5, '2006-06-01', 'john'),
    (5, '2006-07-01', 'john'),
    (6, '2006-07-01', 'jacob'),
    (7, '2006-07-02', 'jasmine'),
    (7, '2006-07-04', 'jasmine');

我正在尝试了解现有客户的行为。我试图回答这个问题:

根据客户上次订单的时间(当月、上个月 (m-1)...到 m-12)再次订购的可能性有多大?

可能性计算如下:

distinct count of people who ordered in current month /
distinct count of people in their cohort.

因此,我需要生成一个表格,列出当月订购且属于给定群组的人数。

那么,加入群组的规则是什么?

- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc

我使用 DVD Store 数据库作为示例数据来开发查询:http://linux.dell.com/dvdstore/

这是群组规则和聚合的示例,以 7 月为基础 “正在分析的月份订单”(请注意:“正在分析的月份订单”列是下面“所需输出”表中的第一列):

customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james       | 1  1  | 1     | 1     | <- member of jul cohort, made order in jul
jasmine     | 1  1  |       |       | <- member of jul cohort, made order in jul
jacob       | 1     |       |       | <- member of jul cohort, did NOT make order in jul
john        | 1     | 1     | 1     | <- member of jun cohort, made order in jul
julia       |       | 1     | 1     | <- member of jun cohort, did NOT make order in jul
juliet      | 1     |       | 1     | <- member of may cohort, made order in jul
jacinta     |       |       | 1 1   | <- member of may cohort, did NOT make order in jul

此数据将输出下表:

--where m = month's orders being analysed

month's orders |how many people |how many people from  |how many people   |how many people from    |how many people   |how many people from    |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16         |5               |1                     |                  |                        |                  |                        |
jun-16         |                |                      |5                 |3                       |                  |                        |
jul-16         |3               |2                     |2                 |1                       |2                 |1                       |

到目前为止,我的尝试有以下几种:

generate_series()

row_number() over (partition by customer_id order by rental_id desc)

我还无法将所有内容整合在一起(我已经尝试了很多小时但尚未解决)。

为了便于阅读,我认为分部分发布我的工作会更好(如果有人希望我完整地发布 sql 查询,请发表评论 - 我会添加它)。

系列查询:

(select
    generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
    rental) as series

排名查询:

(select
    *,
    row_number() over (partition by customer_id order by rental_id desc) as rnk
from
    rental
where
    date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked

我想做类似的事情:对系列查询返回的每一行运行orders_ranked查询,然后基于orders_ranked的每个返回进行基础聚合。

类似于:

(--this query counts the customers in cohort m-1
select
    count(distinct customer_id)
from
    (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
    select
        *,
        row_number() over (partition by customer_id order by rental_id desc) as rnk
    from
        rental
    where
        date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
    (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
    OR
    (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,


(--this query counts the customers in cohort m-1 who ordered in month m
select
    count(distinct customer_id)
from
    (--this query returns the orders by customers in cohort m-1
    select
        count(distinct customer_id)
    from
        (--this query ranks the orders that have occured <= to the date in the row of the 'series' table
        select
            *,
            row_number() over (partition by customer_id order by rental_id desc) as rnk
        from
            rental
        where
            date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
    where
        (rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
        OR
        (rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
    rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
    (select
        generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
    from
        rental) as series

最佳答案

这个查询完成了所有事情。它对整个表进行操作,并且适用于任何时间范围。

基于一些假设并假设当前的 Postgres 版本为 9.5。至少应该适用于 9.1 页。由于我不清楚您对“队列”的定义,因此我跳过了“队列中有多少人”列。

我希望它比您迄今为止尝试过的任何东西都快。按数量级计算。

SELECT *
FROM   crosstab (
   $$
   SELECT mon
        , sum(count(*)) OVER (PARTITION BY mon)::int AS m0
        , gap   -- count of months since last order
        , count(*) AS gap_ct
   FROM  (
      SELECT mon
           , mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
      FROM  (
         SELECT DISTINCT ON (1,2)
                date_trunc('month', rental_date)::date AS mon
              , customer_id                            AS c_id
              , extract(YEAR  FROM rental_date)::int * 12
              + extract(MONTH FROM rental_date)::int   AS mon_int
         FROM   rental
         ) dist_customer
      ) gap_to_last_month
   GROUP  BY mon, gap
   ORDER  BY mon, gap
   $$
 , 'SELECT generate_series(1,12)'
   ) ct (mon date, m0 int
       , m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
       , m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);

结果:

    mon     | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12
------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----
 2015-01-01 | 63 |  36 |  15 |   5 |   3 |   3 |     |     |     |     |     |     |
 2015-02-01 | 56 |  35 |   9 |   9 |   2 |     |   1 |     |     |     |     |     |
...

m0 ..本月订单数 >= 1 的客户
m01 .. 本月有 >= 1 个订单且 1 个月前有 >= 1 个订单的客户(中间没有)
m02 .. 本月有 >= 1 个订单且 2 个月前有 >= 1 个订单且期间没有订单的客户
等等

如何?

  1. 在子查询 dist_customer 中,将 customer_id (mon, c_id) 减少为每月一行,并使用 DISTINCT ON:

    为了简化以后的计算,请添加日期的月份数 (mon_int)。相关:

    如果每个(月、客户)有许多个订单,则第一步有更快的查询技术:

  2. 在子查询gap_to_last_month中添加列gap,指示同一客户的任何订单本月与上个月之间的时间差距。为此使用窗口函数lag()。相关:

  3. 在外部 SELECT 中,每个 (mon, gap) 聚合以获取您想要的计数。此外,获取本月m0的不同客户总数。

  4. 将此查询提供给 crosstab(),将结果转换为所需的表格形式。基础知识:

    关于“额外”列m0:

关于sql - 基于滚动群组的滚动计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38412672/

相关文章:

sql - 选择多个 id 之前的行

mysql - 8 表连接和选择 count(distinct()) 未正确返回

php - 如何在 PhalconPHP 中运行 RAW SQL 查询

MySQL连接错误

MySQL 排序 : Put at bottom if field = X or A, 否则按日期排序

python - django 查询集中的横向连接(为了使用 jsonb_to_recordset postgresql 函数)

postgresql - 如何使用 SQL Alchemy (PostgreSQL) 锁定表?

java - 无法使用 Jooq 执行 PostgreSQL 函数

pandas - Pandas 交叉表与 Pandas 数据透视表有何不同?

python - 如何更新 python pandas 中的交叉表值