我正在尝试查询 Google BigQuery 公共(public) Reddit 数据集。我的目标是使用 Jaccards' Index 计算 subreddits 的相似度,其定义为:
我的计划是根据 2016 年 8 月的评论数量选择前 N=1000 个 subreddits。然后计算它们的笛卡尔积以获得 subreddit1, subreddit2
形状中所有 subreddit 的组合.
然后使用这些组合行来查询 subreddit1 和 subreddit 2 之间的用户并集以及交集。
到目前为止我的查询是这样的:
SELECT
subreddit1,
subreddit2,
(SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
OR subreddit = subreddit2
LIMIT 1
) as subreddits_union,
(
SELECT
COUNT(DISTINCT author)
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit = subreddit1
AND author IN (
SELECT author
FROM `fh-bigquery.reddit_comments.2016_08`
WHERE subreddit= subreddit2
GROUP BY author
) as subreddits_intersection
FROM
(SELECT a.subreddit as subreddit1, b.subreddit as subreddit2
FROM (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) a
CROSS JOIN (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 1000
) b
WHERE a.subreddit < b.subreddit
)
理想情况下会给出结果:
subreddit1, subreddit2, subreddits_union, subreddits_interception
-----------------------------------------------------------------
Art | Politics | 50000 | 21000
Art | Science | 92320 | 15000
... | ... | ... | ...
但是,此查询出现以下 BigQuery 错误:
错误:不支持引用其他表的相关子查询,除非可以取消相关,例如将它们转换为有效的 JOIN。
我明白了。但是我不认为这个查询可以转化为有效的连接。鉴于 BQ 没有 apply 方法,是否有任何方法可以在不诉诸单独查询的情况下设置此查询?也许用PARTITION BY
?
最佳答案
Thanks for your answer. This one works pretty well in returning the subreddit union , however, how would you implement the intersection ?
也许是类似的东西
WITH top_most AS (
SELECT subreddit, count(*) as n_comments
FROM `fh-bigquery.reddit_comments.2016_08`
GROUP BY subreddit
ORDER BY n_comments DESC
LIMIT 20
),
authors AS (
SELECT DISTINCT author, subreddit
FROM `fh-bigquery.reddit_comments.2016_08`
)
SELECT
count(DISTINCT a1.author),
subreddit1, subreddit2
FROM
(
SELECT t1.subreddit subreddit1, t2.subreddit subreddit2
FROM top_most t1 CROSS JOIN top_most t2 LIMIT 1000000
)
INNER JOIN authors a1 on a1.subreddit = subreddit1
INNER JOIN authors a2 on a2.subreddit = subreddit2
WHERE a1.author = a2.author
GROUP BY subreddit1, subreddit2
ORDER BY subreddit1, subreddit2
关于sql - BigQuery - 复杂的相关查询,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39827823/