google-bigquery - 有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果没有,什么是合适的替代方案?

标签 google-bigquery

我有一个看起来像这样的订单表:

WITH my_table_of_orders AS (
  SELECT
    1 AS order_id,
    DATE(2019, 5, 12) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    2 AS order_id,
    DATE(2019, 5, 11) AS date,
    5 AS customer_id,
    TRUE AS is_from_particular_store

  UNION ALL SELECT
    3 AS order_id,
    DATE(2019, 5, 11) AS date,
    4 AS customer_id,
    FALSE AS is_from_particular_store
)

我的实际表包含约 5900 万行。

我想做的基本上是按订单日期返回一行,第二列表示过去一年(相对于当前行的日期)下订单的客户的百分比特定商店(我虚构的 is_from_pspecial_store 列可以派上用场)。

理想情况下,我可以使用以下查询,而不会遇到资源问题。唯一的问题是,在分析函数中使用 DISTINCT 时,您无法使用 ORDER BY ,我得到这个如果指定了 DISTINCT,则不允许 Window ORDER BY:

SELECT
  date,
  last_year_customer_id_that_ordered_from_a_particular_store / last_year_customer_id_that_ordered AS number_i_want
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    COUNT(DISTINCT customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    COUNT(DISTINCT IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

然后我尝试使用 ARRAY_AGGUNNEST 代替:

SELECT
  date,
  SAFE_DIVIDE((SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered_from_a_particular_store) AS customer_id
  ), (SELECT COUNT(DISTINCT customer_id)
    FROM UNNEST(last_year_customer_id_that_ordered) AS customer_id
  )) AS number_i_want_to_calculate
FROM (
  SELECT
    date,
    ROW_NUMBER() OVER (
      PARTITION BY
        date
    ) AS row_num,
    ARRAY_AGG(customer_id) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered,
    ARRAY_AGG(IF(is_from_particular_store, customer_id, NULL)) OVER(
      ORDER BY
        UNIX_SECONDS(TIMESTAMP(date))
      -- 31,536,000 = 365 days in seconds
      RANGE BETWEEN 31536000 PRECEDING AND CURRENT ROW
    ) AS last_year_customer_id_that_ordered_from_a_particular_store,
  FROM my_table_of_orders
)
WHERE
  -- only return one row per date
  row_num = 1

唯一的问题是我遇到以下资源问题...

Resources exceeded during query execution: The query could not be executed in the allotted memory.

这个问题非常相似https://stackoverflow.com/a/42567839/3902555并建议使用 ARRAY_AGG + UNNEST 但就像我说的这会给我带来资源问题:(

有人知道一种更有效的资源效率方法来计算我所追求的统计数据吗?

最佳答案

另一个完全重构的版本(BigQuery Standard SQL)

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    COUNT(DISTINCT IF(b.is_from_particular_store, b.customer_id, NULL)),
    COUNT(DISTINCT b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date   

上面的替代方法是使用 Approximate Aggregation如下例所示

#standardSQL
WITH temp AS (
  SELECT DISTINCT DATE, customer_id, is_from_particular_store
  FROM my_table_of_orders
)
SELECT a.date, 
  SAFE_DIVIDE(
    APPROX_COUNT_DISTINCT(IF(b.is_from_particular_store, b.customer_id, NULL)),
    APPROX_COUNT_DISTINCT(b.customer_id)
  ) AS number_i_want_to_calculate
FROM temp a
CROSS JOIN temp b
WHERE DATE_DIFF(a.date, b.date, YEAR) < 1
GROUP BY a.date

关于google-bigquery - 有没有办法在 COUNT 聚合分析函数中使用 ORDER BY 子句?如果没有,什么是合适的替代方案?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/62582377/

相关文章:

google-bigquery - 为什么这 2 个查询具有相同的 "GB processed"(因此成本)?

python - Google Bigquery 某些内容返回空/无行

sql - 大查询 : Checking for null in conditional expression

python - BigQuery : Load from CSV, 跳过列

google-bigquery - 如何使用 google.datalab.bigquery 从 DataLab 笔记本查询 BigQuery 表?

google-bigquery - 可以在 bigquery 预览我的流数据吗

hadoop - 在计算 Google bigquery 中的列大小时,列名长度是否计入每个单元格大小?

google-analytics - 在不使用内部查询或CTE的情况下,如何在WHERE子句中为大查询中的未套用值应用过滤器?

csv - BigQuery 从 bq 命令行工具加载数据 - 如何跳过标题行