sql - 特定时间窗口内最多重复出现的值

标签 sql google-bigquery

我有以下结构:

table user
user_id | month_year | fruits  
------------------------------
1       | 2021-01    | apple
1       | 2021-01    | melon
1       | 2021-01    | orange
1       | 2021-02    | grape
1       | 2021-02    | orange
1       | 2021-02    | kiwi
1       | 2021-03    | grape
1       | 2021-03    | pear
1       | 2021-03    | banana
1       | 2021-04    | orange
1       | 2021-04    | kiwi
1       | 2021-04    | banana
1       | 2021-05    | grape
1       | 2021-05    | pear
1       | 2021-05    | kiwi

我想要以下结果:

user     | month_year |            fruits            |  two_months_most_freq
-------------------------------------------------------------------------
1        | 2021-01    | apple, melon, orange         | orange
1        | 2021-02    | grape, orange, kiwi          | orange
1        | 2021-03    | grape, pear, banana          | grape
1        | 2021-04    | orange, kiwi, banana         | banana
1        | 2021-05    | grape, pear, kiwi            | kiwi

清算:在最后一列中,我想要最近 2 个月内最常出现的水果,换句话说,就是实际行和上一行中重复次数最多的水果。请注意,第一行应返回橙色,因为当后面的窗口框架不可用时,应使用前面的窗口框架。

在下面的代码中,我获得了整个数据集中出现次数最多的水果。

select * from (
  select user_id, year_month, 
    string_agg(distinct fruit) as fruits
  from user
  group by  user_id, year_month
) join (
  select user_id, fruit
  from user
  group by user_id, fruit
  qualify 1 = row_number() over(partition by user_id order by count(*) desc)
)
using (user_id)   

如何在特定时间窗口应用此逻辑?

最佳答案

考虑下面

select user_id, month_year, fruits, 
  if(prev_month_exists, two_months_most_freq, first_value(two_months_most_freq) over next_month) as two_months_most_freq 
from (
  select user_id, month_year, fruits, 
    ( select fruit from unnest(split(two_month_fruits)) fruit
      group by fruit order by count(*) desc limit 1
    ) as two_months_most_freq, 
    month, prev_month_exists
  from (
    select distinct user_id, month_year, month, 
      string_agg(fruit) over(partition by user_id, month_year) fruits,
      string_agg(fruit) over last_two_months as two_month_fruits,
      0 < count(*) over prev_month as prev_month_exists
    from users, unnest([struct(
      12 * extract(year from date(month_year || '-01')) + extract(month from date(month_year || '-01')) as month
    )]) 
    window 
      last_two_months as (partition by user_id order by month range between 1 preceding and current row), 
      prev_month as (partition by user_id order by month range between 1 preceding and 1 preceding)
  )
)
window next_month as (partition by user_id order by month range between 1 following and 1 following)                 

如果应用于问题中的示例数据 - 输出为

enter image description here

关于sql - 特定时间窗口内最多重复出现的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73616022/

相关文章:

java - BigQuery 性能和运行并发作业

sql - 递归CTE结果是无限的

mysql - 按日期加入 SQL 查询

google-analytics - 加入登录页面查询会使每个来源的 session 数翻倍

apache-spark - 通过 Spark : Requested multiple partitions but getting only 1 使用 BigQuery Storage API

google-bigquery - BigQuery 高效查询最新表

sql - 限制在 Android 中获取的记录(Sqlite 数据库)

MySQL: bool 值上的 "= true"与 "is true"。什么时候最好使用哪一个?哪一个是独立于供应商的?

mysql - 我怎样才能得到这个查询中每个技术人员最后完成的服务?

javascript - 使用 javascript API 将查询插入为作业