sql - 带分组+时间戳列的最新数据的物化 View

标签 sql google-bigquery

我在 Bigquery 中对特征(或属性)进行建模。这是模型的示例

uid         string, link to owner id
uuid        string, unique among all rows
trait_name  string, name of the trait
trait_value string
added_at    timestamp, when the trait was added

我正在尝试构建一个物化 View ,该 View 包含每个 uid 的每个特征的最新特征。我能够通过此查询获得结果:

WITH traits AS (
  SELECT 'u1' uid, 'uu1' uuid, 't1' trait_name, 't1v1' trait_value, timestamp("2021-10-01 10:00:00") as added_at UNION ALL
  SELECT 'u1' uid, 'uu2' uuid, 't1' trait_name, 't1v2' trait_value, timestamp("2021-10-02 10:00:00") as added_at UNION ALL
  SELECT 'u1' uid, 'uu3' uuid, 't2' trait_name, 't2v1' trait_value, timestamp("2021-10-03 10:00:00") as added_at UNION ALL
  SELECT 'u1' uid, 'uu4' uuid, 't2' trait_name, 't2v2' trait_value, timestamp("2021-10-04 10:00:00") as added_at UNION ALL
  SELECT 'u2' uid, 'uu5' uuid, 't1' trait_name, 't1v1' trait_value, timestamp("2021-10-05 10:00:00") as added_at UNION ALL
  SELECT 'u2' uid, 'uu6' uuid, 't1' trait_name, 't1v2' trait_value, timestamp("2021-10-06 10:00:00") as added_at UNION ALL
  SELECT 'u2' uid, 'uu7' uuid, 't2' trait_name, 't2v1' trait_value, timestamp("2021-10-07 10:00:00") as added_at UNION ALL
  SELECT 'u2' uid, 'uu8' uuid, 't2' trait_name, 't2v2' trait_value, timestamp("2021-10-08 10:00:00") as added_at
)
SELECT * FROM (
  SELECT *,
  MAX(added_at) OVER (PARTITION BY uid, trait_name) as latest_added_at FROM traits
) WHERE latest_added_at = added_at

#  Row  uid uuid    trait_name  trait_value added_at                 latest_added_at    
#  1    u1  uu2     t1          t1v2        2021-10-02 10:00:00 UTC  2021-10-02 10:00:00 UTC
#  2    u1  uu4     t2          t2v2        2021-10-04 10:00:00 UTC  2021-10-04 10:00:00 UTC
#  3    u2  uu6     t1          t1v2        2021-10-06 10:00:00 UTC  2021-10-06 10:00:00 UTC
#  4    u2  uu8     t2          t2v2        2021-10-08 10:00:00 UTC  2021-10-08 10:00:00 UTC

但是我不能将它用于物化 View ,因为它们不支持它:

Materialized views do not support analytic functions or WITH OFFSET.

我也尝试过使用连接

SELECT * FROM traits t
JOIN (
    SELECT
    uid,
    trait_name,
    MAX(added_at) AS max_added_at 
    FROM traits GROUP BY uid, trait_name
) grouped
ON t.uid = grouped.uid 
AND t.trait_name = grouped.trait_name
AND t.added_at = grouped.max_added_at

但是他们也不支持

Materialized views queries may not reference the same table more than once. Table traits was seen multiple times.

有没有办法把它做成物化 View ?

最佳答案

物化 View 有一些限制,可能无法实现(有关更多信息,请参阅 this 公共(public)文档)。

因此目前看来您正在尝试实现的目标并未得到完全支持,使用 ARRAY_AGG 似乎是最接近的可能,缺点之一是结果是数组类型,而不是值本身。

我设法通过此查询检索到预期结果:

CREATE MATERIALIZED VIEW test_traits.mv_sample_traits as
SELECT uid,
       ARRAY_AGG(uuid IGNORE NULLS ORDER BY added_at DESC, uuid DESC, trait_value DESC LIMIT 1) as uuid,
       trait_name,
       ARRAY_AGG(trait_value IGNORE NULLS ORDER BY added_at DESC, uuid DESC, trait_value DESC LIMIT 1) as trait_value,
       ARRAY_AGG(added_at IGNORE NULLS ORDER BY added_at DESC, uuid DESC, trait_value DESC LIMIT 1) as added_at,
       MAX(added_at) AS latest_added_at,
FROM test_traits.traits
GROUP BY uid,trait_name


/*
SELECT * FROM test_traits.mv_sample_traits;
*/

sample

请注意,代码仅供引用,您需要运行测试来验证它是否适合您的用例,不能保证这会按预期工作,因此我不推荐将其用于生产。

关于sql - 带分组+时间戳列的最新数据的物化 View ,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69857647/

相关文章:

mysql - 如何从单个表中找到共同值(value)?

.net - ADO.Net 最佳实践 - 进行异步数据库调用时的单连接与多连接

php - 您知道任何用于导出/导入 MySQL 数据库转储的 PHP 脚本或库吗?

sql - 如何在 Data Studio BigQuery 社区连接器的 SQL 查询中包含日期范围

mysql - mysql查询中独立列的分组

sql - 动态生成列的总和

google-bigquery - 如何将多个文件(相同架构)加载到 BigQuery 中的表中?

google-bigquery - 带循环的 BigQuery UDF

google-cloud-platform - 连接来自不同区域的 BigQuery 表失败

google-bigquery - 在bigquery中,有没有办法通过浏览器界面将数据字段设置为可为空