google-bigquery - COUNT(DISTINCT foo) 内的逻辑检查和性能问题

标签 google-bigquery

我需要运行一个常规且非常昂贵的查询,不幸的是,我必须将该查询的结果与几乎完全相同的查询连接起来才能获得比率...导致使用查询接管跑3分钟。这就是我想做的......(假设避免 JOIN 会加快查询时间)

SELECT
    date,
    meal,
    country,
    COUNT(DISTINCT person, WHERE UPPER(ingredient) CONTAINS "SUN BUTTER", 10000000) as total_sunbutter_meals_per_day
    COUNT(DISTINCT person, 10000000) as total_meals
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM [project:dataset.menu]
GROUP BY date, meals, country

这是我被迫做的......

SELECT
    total.date as date,
    total.meal as meal,
    total.country as country,
    total_sunbutter_meals_per_day,
    total_meals_per_day,
    ROUND(100*total_sunbutter_meals_per_day/total_meals,1) as percentage_meals_sunbutter
FROM
    (    
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_sunbutter_meals_per_day
    FROM [project:dataset.menu]
    WHERE    
        UPPER(ingredient) CONTAINS "SUN BUTTER"
    GROUP BY date, meals, country 
    ) as sunbutter
JOIN
    (
    SELECT
        date,
        meal,
        country,
        COUNT(DISTINCT person, 100000) as total_meals_per_day
    FROM [project:dataset.menu]
    GROUP BY date, meals, country 
    ) as total
ON total.date = sunbutter.date AND total.meal = sunbutter.meal AND total.country = sunbutter.country

三个问题:

  1. 似乎应该有一种方法 Big Query 可以通过一些嵌入的条件逻辑执行 COUNT(DISTINCT 字段)。有没有办法避免在上面的这种情况下进行连接?
  2. 数值大于 100,000 的 COUNT DISTINCT 对我来说失败了。我希望能够进行 10,000,000 次不同的计数。 COUNT DISTINCT 和大值是否存在已知的性能问题?这个问题正在解决吗?
  3. 是否计划能够在该 SELECT 中的另一个语句中使用 SELECT 中声明/计算的字段名称?在上面的示例中,我想使用结果的名称,而不是在 ROUND 语句中重复公式。 (即我想指定

    total_sunbutter_meals_per_day/total_meals 而不是

    COUNT(不同的人,其中上部(成分)包含“防晒霜”,100000)/COUNT(不同的人,10000000)

预先感谢您的帮助!

最佳答案

问题1:

您可以创建一个包含两个不同字段的内部查询,如下所示:

SELECT
  date,
  meal,
  country,
  COUNT(DISTINCT person) total_meals,
  COUNT(DISTINCT sunbutter_person) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     person,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", person, NULL) sunbutter_person
   FROM [project:dataset.menu])

问题2:

在 BigQuery 中,COUNT(DISTINCT) 返回近似结果。如果您增加返回精确结果的阈值,则会损害性能(并最终导致查询失败),因为单个工作人员需要跟踪所有这些不同的值。请参阅BigQuery COUNT(DISTINCT value) vs COUNT(value)了解更多信息。

如果您对精确结果的需求超出了 COUNT(DISTINCT) 的可扩展性,则替代方法是将 GROUP EACH BY 与 COUNT(*) 结合使用,这将以可扩展的方式为您提供不同元素的精确计数。

请注意,您需要以稍微不同的方式解决问题 1 中的问题。像这样的东西:

SELECT
  date,
  meal,
  country,
  COUNT(*) total_meals,
  SUM(sunbutter) total_sunbutter_meals,
FROM
  (SELECT
     date,
     meal,
     country,
     IF(UPPER(ingredient) CONTAINS "SUN BUTTER", 1, 0) sunbutter,
   FROM [project:dataset.menu]
   GROUP EACH BY date, meal, country, person)
GROUP BY date, meal, country

问题3:

目前,您无法在同一 SELECT 语句中引用其他字段,而且我们还没有计划添加该功能。但您始终可以将查询包装在另一个查询中。

而不是:

SELECT 17 AS a, a + 1 AS b

你可以写:

SELECT a, a + 1 AS b FROM (SELECT 17 AS a)

关于google-bigquery - COUNT(DISTINCT foo) 内的逻辑检查和性能问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22075809/

相关文章:

google-bigquery - 如何在 Google Bigquery 中将时间戳转换为日期数据类型

sql - 从一组范围计算并发性

authentication - 如何将谷歌应用程序脚本发布到云端供公众使用?

google-bigquery - 未设置应用程序名称。调用Builder#setApplicationName。错误

sql - BigQuery JOIN 错误

google-bigquery - BigQuery 中是否有自动增量?

google-bigquery - 在 Airflow 中循环参数的最佳方法?

r - 使用 query_exec() 时 bigrquery 抛出 "Error: Invalid Credentials"

sql - 涉及连接和having(或where)子句的BigQuery嵌套挑战

python - 使用 beam、python 读取具有 Avro 模式的 Big Query 表