sql - 滚动日期范围内的 BigQuery 非重复计数,列上有分区

标签 sql database count google-bigquery window-functions

我有一个包含电子邮件、日期 (TIMESTAMP)、ID 和多个其他列的表格。

对于每个电子邮件条目,我想计算在过去 3 天内与该电子邮件关联的唯一 ID 的数量。

+-------------------+-------------------------+------------+----+
|       email       |           day           | other cols | id |
+-------------------+-------------------------+------------+----+
| user1@gmail.com   | 2020-06-21 16:31:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-22 14:54:00 UTC |        ... |  1 |
| user1@gmail.com   | 2020-06-23 08:23:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-24 13:51:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-25 09:54:00 UTC |        ... |  2 |
| user1@gmail.com   | 2020-06-25 12:25:00 UTC |        ... |  0 |
| user1@gmail.com   | 2020-06-26 15:21:00 UTC |        ... |  2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC |        ... |  0 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC |        ... |  1 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC |        ... |  2 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC |        ... |  1 |
+-------------------+-------------------------+------------+----+

附加列应如下所示:

+-------------------+-------------------------+------------+----+-----------------------------+
|       email       |           day           | other cols | id | distinct ids in last 3 days |
+-------------------+-------------------------+------------+----+-----------------------------+
| user1@gmail.com   | 2020-06-21 16:31:00 UTC |        ... |  0 |                           1 |
| user1@gmail.com   | 2020-06-22 14:54:00 UTC |        ... |  1 |                           2 |
| user1@gmail.com   | 2020-06-23 08:23:00 UTC |        ... |  0 |                           2 |
| user1@gmail.com   | 2020-06-24 13:51:00 UTC |        ... |  0 |                           2 |
| user1@gmail.com   | 2020-06-25 09:54:00 UTC |        ... |  2 |                           3 |<- 3, because ids 0, 1 and 2 have been seen in previous 3 days
| user1@gmail.com   | 2020-06-25 12:25:00 UTC |        ... |  0 |                           3 |
| user1@gmail.com   | 2020-06-26 15:21:00 UTC |        ... |  2 |                           2 |
| user2@hotmail.com | 2020-06-21 12:23:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-21 16:54:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-22 08:23:00 UTC |        ... |  0 |                           1 |
| user2@hotmail.com | 2020-06-22 12:13:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-24 09:32:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-25 05:45:00 UTC |        ... |  1 |                           2 |
| user2@hotmail.com | 2020-06-26 12:32:00 UTC |        ... |  1 |                           1 |
| user2@hotmail.com | 2020-06-27 19:53:00 UTC |        ... |  1 |                           1 |
+-------------------+-------------------------+------------+----+-----------------------------+

在过去的 3 天里,我尝试使用窗口函数按电子邮件进行分区并计算不同的 ID。

COUNT(DISTINCT id) OVER (PARTITION BY email ORDER BY UNIX_DATE(PARSE_DATE('%Y-%m-%d', day))*24*3600 RANGE BETWEEN 3*24*3600 PRECEDING AND CURRENT ROW)

但是这是不允许的:

Window ORDER BY is not allowed if DISTINCT is specified

栈溢出有解决办法,比如this .但是,我不确定在计算唯一 ID 之前是否需要按电子邮件进行分区。

如果有任何关于此的指示,我将不胜感激。如果它更容易,我也会接受使用 DATE 而不是 TIMESTAMP 的解决方案。

最佳答案

以下是 BigQuery 标准 SQL

#standardSQL
SELECT * EXCEPT(ids),
  (SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
  SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
  FROM `project.dataset.table`
  WINDOW preceding_days AS (
    PARTITION BY email 
    ORDER BY UNIX_DATE(DATE(day)) 
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  )
) t   

您可以使用问题中的样本数据进行测试,如以下示例所示

#standardSQL
WITH `project.dataset.table` AS (
  SELECT 'user1@gmail.com' email, TIMESTAMP'2020-06-21 16:31:00 UTC' day, '...' other_cols, 0 id UNION ALL
  SELECT 'user1@gmail.com', '2020-06-22 14:54:00 UTC', '...', 1 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-23 08:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-24 13:51:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-25 09:54:00 UTC', '...', 2 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-25 12:25:00 UTC', '...', 0 UNION ALL
  SELECT 'user1@gmail.com', '2020-06-26 15:21:00 UTC', '...', 2 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-21 12:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-21 16:54:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-22 08:23:00 UTC', '...', 0 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-22 12:13:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-24 09:32:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-25 05:45:00 UTC', '...', 1 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-26 12:32:00 UTC', '...', 2 UNION ALL
  SELECT 'user2@hotmail.com', '2020-06-27 19:53:00 UTC', '...', 1 
)
SELECT * EXCEPT(ids),
  (SELECT COUNT(DISTINCT id) FROM t.ids AS id) distinct_ids
FROM (
  SELECT *, ARRAY_AGG(id) OVER(preceding_days) ids
  FROM `project.dataset.table`
  WINDOW preceding_days AS (
    PARTITION BY email 
    ORDER BY UNIX_DATE(DATE(day)) 
    ROWS BETWEEN 3 PRECEDING AND CURRENT ROW
  )
) t

关于sql - 滚动日期范围内的 BigQuery 非重复计数,列上有分区,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/63158386/

相关文章:

mysql 值大于值数量的百分比

mysql - 从 MySQL 中的时间戳获取最近 'X' 天的记录

mysql - 为每个客户 X 编写一个查询,另一位客户 Y 至少租借了一部与 X 共同的电影

sql - 别名链接服务器?

sql - 可扩展的 CRUD 应用程序。储存技术

mysql - 查找MySQL中两个表的行数差异

sql - 如何在忽略两列在 Postgres 中的顺序的同时限制它们的唯一性

ruby-on-rails - 如何使用 PSequel GUI 在 Mac 上连接到本地主机 PostgreSQL?

sql - 为什么我的3个INNER JOIN在同一张 table 上具有不同的组和计数不能与sqlite一起使用?

python - 在python 3中查找字符串中某个单词的出现