我有以下结构:
table user
user_id | month_year | fruits
------------------------------
1 | 2021-01 | apple
1 | 2021-01 | melon
1 | 2021-01 | orange
1 | 2021-02 | grape
1 | 2021-02 | orange
1 | 2021-02 | kiwi
1 | 2021-03 | grape
1 | 2021-03 | pear
1 | 2021-03 | banana
1 | 2021-04 | orange
1 | 2021-04 | kiwi
1 | 2021-04 | banana
1 | 2021-05 | grape
1 | 2021-05 | pear
1 | 2021-05 | kiwi
我想要以下结果:
user | month_year | fruits | two_months_most_freq
-------------------------------------------------------------------------
1 | 2021-01 | apple, melon, orange | orange
1 | 2021-02 | grape, orange, kiwi | orange
1 | 2021-03 | grape, pear, banana | grape
1 | 2021-04 | orange, kiwi, banana | banana
1 | 2021-05 | grape, pear, kiwi | kiwi
清算:在最后一列中,我想要最近 2 个月内最常出现的水果,换句话说,就是实际行和上一行中重复次数最多的水果。请注意,第一行应返回橙色,因为当后面的窗口框架不可用时,应使用前面的窗口框架。
在下面的代码中,我获得了整个数据集中出现次数最多的水果。
select * from (
select user_id, year_month,
string_agg(distinct fruit) as fruits
from user
group by user_id, year_month
) join (
select user_id, fruit
from user
group by user_id, fruit
qualify 1 = row_number() over(partition by user_id order by count(*) desc)
)
using (user_id)
如何在特定时间窗口应用此逻辑?
最佳答案
考虑下面
select user_id, month_year, fruits,
if(prev_month_exists, two_months_most_freq, first_value(two_months_most_freq) over next_month) as two_months_most_freq
from (
select user_id, month_year, fruits,
( select fruit from unnest(split(two_month_fruits)) fruit
group by fruit order by count(*) desc limit 1
) as two_months_most_freq,
month, prev_month_exists
from (
select distinct user_id, month_year, month,
string_agg(fruit) over(partition by user_id, month_year) fruits,
string_agg(fruit) over last_two_months as two_month_fruits,
0 < count(*) over prev_month as prev_month_exists
from users, unnest([struct(
12 * extract(year from date(month_year || '-01')) + extract(month from date(month_year || '-01')) as month
)])
window
last_two_months as (partition by user_id order by month range between 1 preceding and current row),
prev_month as (partition by user_id order by month range between 1 preceding and 1 preceding)
)
)
window next_month as (partition by user_id order by month range between 1 following and 1 following)
如果应用于问题中的示例数据 - 输出为
关于sql - 特定时间窗口内最多重复出现的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73616022/