相当于R查询的sql

标签 sql r postgresql

我有两个数据集 author_data 和 paper_author

作者数据:

author_id       author_name          author_affiliation
 25         William H. Nailon                                                                    
 37         P. B. Littlewood        Cavendish Laboratory|Cambridge University
 44         A. Kuroiwa               Department of Molecular Biology 

论文作者:

paper_id     author_id      author_name      author_affiliation
  1          521630         Ayman Kaheel     Cairo Microsoft Innovation Lab
  1          972575       Mahmoud Refaat     Cairo Microsoft Innovation Lab

我在 R 中运行了以下查询

author_data[which(author_data$author_id %in% paper_author$author_id &
                  author_data$author_name %in% paper_author$author_name & 
                  author_data$author_affiliation %in% paper_author$author_affiliation), ]

也就是说,我想找出 author_data 和 paper_author 之间的匹配项,其中 author_idauthor_nameauthor_affiliation 三列匹配。

我已经写了一个查询来在 sql 中得到这个结果,但我没有得到正确的结果。我试过的查询是

statement <- "select
              author_data.author_id,
              author_data.author_name,
              author_data.author_affiliation
        FROM author_data
        INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation"

通过这个查询,我得到的行比 author_data 中的行多,查询应该获取首先是 author_data 子集的数据。由于我对 sql 很天真,所以无法弄清楚问题出在哪里。

这个查询有什么问题?

谢谢

最佳答案

R 中的 which 和 SQL 中的 join 是有区别的。 which 将有效地对给定的数据帧进行子集化,而 join 将返回满足 join 条件的所有行。我几乎可以肯定,在您的情况下,您在 paper_author 中多次出现 author_id, author_name, author_affiliation 组合。因此,author_data 中的行乘以 paper_author 中的行。

您的查询几乎是正确的,您需要添加 distinctgroup by 或使用 exists:

区别:

select
   distinct
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation

分组依据:

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
   INNER JOIN paper_author
          ON author_data.author_id = paper_author.author_id
            AND author_data.author_name = paper_author.author_name
            AND author_data.author_affiliation = paper_author.author_affiliation
group by
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation

你也可以使用exists:

select
   author_data.author_id,
   author_data.author_name,
   author_data.author_affiliation
from
   author_data
where
   exists (select 1 from paper_author where
       author_data.author_id = paper_author.author_id
       AND author_data.author_name = paper_author.author_name
       AND author_data.author_affiliation = paper_author.author_affiliation
       )

关于相当于R查询的sql,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22537466/

相关文章:

Postgresql 包含 "optional",可为空的外键?

sql - 将同一个表中的 2 个 SQL 查询合并为单个输出

MySQL:子查询返回超过 1 行

sql - 获取一定数量的连续行

r - 当某些行包含逗号作为千位分隔符和“标志并且没有小数的行没有标志时,如何读取R中的数据

python - 大量数据上的随机森林模型

从plotreg()输出中删除框架并旋转标题

postgresql - Postgres 序列不同步

mysql - 数据库设计 : force a child table to populate with predefined data when the parent record is created

sql - 如何使用 CTE 和 INNER JOIN 删除行?