使用 Postgres 9.5。测试数据:
create temp table rental (
customer_id smallint
,rental_date timestamp without time zone
,customer_name text
);
insert into rental values
(1, '2006-05-01', 'james'),
(1, '2006-06-01', 'james'),
(1, '2006-07-01', 'james'),
(1, '2006-07-02', 'james'),
(2, '2006-05-02', 'jacinta'),
(2, '2006-05-03', 'jacinta'),
(3, '2006-05-04', 'juliet'),
(3, '2006-07-01', 'juliet'),
(4, '2006-05-03', 'julia'),
(4, '2006-06-01', 'julia'),
(5, '2006-05-05', 'john'),
(5, '2006-06-01', 'john'),
(5, '2006-07-01', 'john'),
(6, '2006-07-01', 'jacob'),
(7, '2006-07-02', 'jasmine'),
(7, '2006-07-04', 'jasmine');
我正在尝试了解现有客户的行为。我试图回答这个问题:
根据客户上次订单的时间(当月、上个月 (m-1)...到 m-12)再次订购的可能性有多大?
可能性计算如下:
distinct count of people who ordered in current month /
distinct count of people in their cohort.
因此,我需要生成一个表格,列出当月订购且属于给定群组的人数。
那么,加入群组的规则是什么?
- current month cohort: >1 order in month OR (1 order in month given no previous orders)
- m-1 cohort: <=1 order in current month and >=1 order in m-1
- m-2 cohort: <=1 order in current month and 0 orders in m-1 and >=1 order in m-2
- etc
我使用 DVD Store 数据库作为示例数据来开发查询:http://linux.dell.com/dvdstore/
这是群组规则和聚合的示例,以 7 月为基础
“正在分析的月份订单”
(请注意:“正在分析的月份订单”
列是下面“所需输出”表中的第一列):
customer_id | jul-16| jun-16| may-16|
------------|-------|-------|-------|
james | 1 1 | 1 | 1 | <- member of jul cohort, made order in jul
jasmine | 1 1 | | | <- member of jul cohort, made order in jul
jacob | 1 | | | <- member of jul cohort, did NOT make order in jul
john | 1 | 1 | 1 | <- member of jun cohort, made order in jul
julia | | 1 | 1 | <- member of jun cohort, did NOT make order in jul
juliet | 1 | | 1 | <- member of may cohort, made order in jul
jacinta | | | 1 1 | <- member of may cohort, did NOT make order in jul
此数据将输出下表:
--where m = month's orders being analysed
month's orders |how many people |how many people from |how many people |how many people from |how many people |how many people from |
being analysed |are in cohort m |cohort m ordered in m |are in cohort m-1 |cohort m-1 ordered in m |are in cohort m-2 |cohort m-2 ordered in m |...m-12
---------------|----------------|----------------------|------------------|------------------------|------------------|------------------------|
may-16 |5 |1 | | | | |
jun-16 | | |5 |3 | | |
jul-16 |3 |2 |2 |1 |2 |1 |
到目前为止,我的尝试有以下几种:
generate_series()
和
row_number() over (partition by customer_id order by rental_id desc)
我还无法将所有内容整合在一起(我已经尝试了很多小时但尚未解决)。
为了便于阅读,我认为分部分发布我的工作会更好(如果有人希望我完整地发布 sql 查询,请发表评论 - 我会添加它)。
系列查询:
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
排名查询:
(select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date) <= series.month_being_analysed) as orders_ranked
我想做类似的事情:对系列查询返回的每一行运行orders_ranked查询,然后基于orders_ranked的每个返回进行基础聚合。
类似于:
(--this query counts the customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
) as people_2nd_last_booking_in_m_1,
(--this query counts the customers in cohort m-1 who ordered in month m
select
count(distinct customer_id)
from
(--this query returns the orders by customers in cohort m-1
select
count(distinct customer_id)
from
(--this query ranks the orders that have occured <= to the date in the row of the 'series' table
select
*,
row_number() over (partition by customer_id order by rental_id desc) as rnk
from
rental
where
date_trunc('month',rental_date)<=series.month_being_analysed) as orders_ranked
where
(rnk=1 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
OR
(rnk=2 between series.month_being_analysed - interval ‘2 months’ and series.month_being_analysed - interval ‘1 months’)
where
rnk=1 in series.month_being_analysed
) as people_who_booked_in_m_whose_2nd_last_booking_was_in_m_1,
...
from
(select
generate_series(date_trunc(‘month’,min(rental_date)),date_trunc(‘month’,max(rental_date)),’1 month)) as month_being_analysed
from
rental) as series
最佳答案
这个查询完成了所有事情。它对整个表进行操作,并且适用于任何时间范围。
基于一些假设并假设当前的 Postgres 版本为 9.5。至少应该适用于 9.1 页。由于我不清楚您对“队列”的定义,因此我跳过了“队列中有多少人”列。
我希望它比您迄今为止尝试过的任何东西都快。按数量级计算。
SELECT *
FROM crosstab (
$$
SELECT mon
, sum(count(*)) OVER (PARTITION BY mon)::int AS m0
, gap -- count of months since last order
, count(*) AS gap_ct
FROM (
SELECT mon
, mon_int - lag(mon_int) OVER (PARTITION BY c_id ORDER BY mon_int) AS gap
FROM (
SELECT DISTINCT ON (1,2)
date_trunc('month', rental_date)::date AS mon
, customer_id AS c_id
, extract(YEAR FROM rental_date)::int * 12
+ extract(MONTH FROM rental_date)::int AS mon_int
FROM rental
) dist_customer
) gap_to_last_month
GROUP BY mon, gap
ORDER BY mon, gap
$$
, 'SELECT generate_series(1,12)'
) ct (mon date, m0 int
, m01 int, m02 int, m03 int, m04 int, m05 int, m06 int
, m07 int, m08 int, m09 int, m10 int, m11 int, m12 int);
结果:
mon | m0 | m01 | m02 | m03 | m04 | m05 | m06 | m07 | m08 | m09 | m10 | m11 | m12 ------------+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+-----+----- 2015-01-01 | 63 | 36 | 15 | 5 | 3 | 3 | | | | | | | 2015-02-01 | 56 | 35 | 9 | 9 | 2 | | 1 | | | | | | ...
m0
..本月订单数 >= 1 的客户
m01
.. 本月有 >= 1 个订单且 1 个月前有 >= 1 个订单的客户(中间没有)
m02
.. 本月有 >= 1 个订单且 2 个月前有 >= 1 个订单且期间没有订单的客户
等等
如何?
在子查询
dist_customer
中,将 customer_id(mon, c_id)
减少为每月一行,并使用DISTINCT ON
:为了简化以后的计算,请添加日期的月份数 (
mon_int
)。相关:如果每个(月、客户)有许多个订单,则第一步有更快的查询技术:
在子查询
gap_to_last_month
中添加列gap
,指示同一客户的任何订单本月与上个月之间的时间差距。为此使用窗口函数lag()
。相关:在外部
SELECT
中,每个(mon, gap)
聚合以获取您想要的计数。此外,获取本月m0
的不同客户总数。将此查询提供给
crosstab()
,将结果转换为所需的表格形式。基础知识:关于“额外”列
m0
:
关于sql - 基于滚动群组的滚动计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38412672/