我有两个数据集 author_data 和 paper_author
作者数据:
author_id author_name author_affiliation
25 William H. Nailon
37 P. B. Littlewood Cavendish Laboratory|Cambridge University
44 A. Kuroiwa Department of Molecular Biology
论文作者:
paper_id author_id author_name author_affiliation
1 521630 Ayman Kaheel Cairo Microsoft Innovation Lab
1 972575 Mahmoud Refaat Cairo Microsoft Innovation Lab
我在 R 中运行了以下查询
author_data[which(author_data$author_id %in% paper_author$author_id &
author_data$author_name %in% paper_author$author_name &
author_data$author_affiliation %in% paper_author$author_affiliation), ]
也就是说,我想找出 author_data 和 paper_author 之间的匹配项,其中 author_id
、author_name
和 author_affiliation
三列匹配。
我已经写了一个查询来在 sql 中得到这个结果,但我没有得到正确的结果。我试过的查询是
statement <- "select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
FROM author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation"
通过这个查询,我得到的行比 author_data 中的行多,查询应该获取首先是 author_data 子集的数据。由于我对 sql 很天真,所以无法弄清楚问题出在哪里。
这个查询有什么问题?
谢谢
最佳答案
R 中的 which
和 SQL 中的 join
是有区别的。 which
将有效地对给定的数据帧进行子集化,而 join
将返回满足 join
条件的所有行。我几乎可以肯定,在您的情况下,您在 paper_author
中多次出现 author_id, author_name, author_affiliation
组合。因此,author_data
中的行乘以 paper_author
中的行。
您的查询几乎是正确的,您需要添加 distinct
或 group by
或使用 exists
:
区别:
select
distinct
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
分组依据:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
INNER JOIN paper_author
ON author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
group by
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
你也可以使用exists
:
select
author_data.author_id,
author_data.author_name,
author_data.author_affiliation
from
author_data
where
exists (select 1 from paper_author where
author_data.author_id = paper_author.author_id
AND author_data.author_name = paper_author.author_name
AND author_data.author_affiliation = paper_author.author_affiliation
)
关于相当于R查询的sql,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22537466/