sql - BigQuery - 复杂的相关查询

标签 sql google-bigquery

我正在尝试查询 Google BigQuery 公共(public) Reddit 数据集。我的目标是使用 Jaccards' Index 计算 subreddits 的相似度,其定义为:

Jaccards Formula

我的计划是根据 2016 年 8 月的评论数量选择前 N=1000 个 subreddits。然后计算它们的笛卡尔积以获得 subreddit1, subreddit2 形状中所有 subreddit 的组合.

然后使用这些组合行来查询 subreddit1 和 subreddit 2 之间的用户并集以及交集。

到目前为止我的查询是这样的:

SELECT 
  subreddit1,
  subreddit2,
  (SELECT 
    COUNT(DISTINCT author)
  FROM `fh-bigquery.reddit_comments.2016_08`
  WHERE subreddit = subreddit1
    OR subreddit = subreddit2
  LIMIT 1
  ) as subreddits_union,

  (
    SELECT 
      COUNT(DISTINCT author)
    FROM `fh-bigquery.reddit_comments.2016_08`
    WHERE subreddit = subreddit1
    AND author IN ( 
       SELECT author 
       FROM `fh-bigquery.reddit_comments.2016_08`
       WHERE subreddit= subreddit2
       GROUP BY author 
      ) as subreddits_intersection

FROM

(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
 FROM  (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) a
 CROSS JOIN (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 1000
   ) b
 WHERE a.subreddit < b.subreddit
  )

理想情况下会给出结果:

subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
   Art     |  Politics |      50000      |      21000
   Art     |  Science  |      92320      |      15000
   ...     |  ...      |      ...        |      ...

但是,此查询出现以下 BigQuery 错误: 错误:不支持引用其他表的相关子查询,除非可以取消相关,例如将它们转换为有效的 JOIN。

我明白了。但是我不认为这个查询可以转化为有效的连接。鉴于 BQ 没有 apply 方法,是否有任何方法可以在不诉诸单独查询的情况下设置此查询?也许用PARTITION BY

最佳答案

Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?

也许是类似的东西

WITH top_most AS (
   SELECT subreddit, count(*) as n_comments
   FROM `fh-bigquery.reddit_comments.2016_08`
   GROUP BY subreddit
   ORDER BY n_comments DESC
   LIMIT 20
),
authors AS (
  SELECT DISTINCT author, subreddit
  FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT 
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
  SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
  FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2

关于sql - BigQuery - 复杂的相关查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39827823/

相关文章:

java - 启动时加载 SQL 数据

php - SQL like 语句问题

google-bigquery - BigQuery ST_SIMPLIFY 返回 GEOMETRYCOLLECTION 而不是 POLYGON

google-bigquery - 在与具有不同架构的表别名相关的单个 model.yml 上运行 dbt 测试

google-bigquery - Bigquery 是否为通过流式传输插入的每一行保存时间戳?

php - Codeigniter $query->list_fields() 在具有相同 php 代码的 LINUX 上不返回任何内容

mysql - 我想使用sequelize nodejs 将四个不同表中的数据添加在一起

python - 无法使用python将JSON文件从谷歌云存储加载到bigquery

python - 错误 403 : Your client does not have permission to get URL in python google cloud module

mysql - SQLite 和 STRFTIME