我有一个与以下虚拟数据集非常相似的大型数据集:
df = data.frame(coursecode = c("WBPH001","WBPH001","WBPH001","WBPH058","WBAS007"),
coursename = c("Mechanics","Mechanics","Mechanics", "Calculus 2","Introduction"),
courseurl = c("url1","url1","url1","url2","url3"),
programme_faculty = c("FSE","FSE","FSE", "FSE", "FSE"),
programme_name = c( "Mat","Bio","Ast","Ast","Ast"),
programme_ects = c("180", "180", "210", "180", "180")
)
这使得(所有值都是字符串):
#> print(df):
coursecode coursename courseurl programme_faculty programme_name programme_ects
1 WBPH001 Mechanics url1 FSE Mat 180
2 WBPH001 Mechanics url1 FSE Bio 180
3 WBPH001 Mechanics url1 FSE Ast 210
4 WBPH058 Calculus 2 url2 FSE Ast 180
5 WBAS007 Introduction url3 FSE Ast 180
我已经导出了整个教师的所有类(class),但有些类(class)在多个程序中列出(在此示例中,例如与“Mat”、“Bio”和“Ast”程序关联的“Mechanics”。
简而言之,我想要实现的是删除所有这些重复的类(class),同时保留项目信息(即名称、等、教师)。
因此,如果“coursecode”、“coursename”和“courseurl”列中存在重复项,它将自动将类(class)信息(“programme_faculty”、“programme name”和“programme_ects”列)折叠到单独的列表中每列
数据集应如下所示:
#> print(modified_df):
coursecode coursename courseurl programme_faculty programme_name programme_ects
1 WBPH001 Mechanics url1 c(FSE, FSE, FSE) c(Mat, Bio, Ast) c(180, 180, 210)
2 WBPH058 Calculus 2 url2 FSE Ast 180
3 WBAS007 Introduction url3 FSE Ast 180
类(class)信息主要用于下游分析,但重要的是始终可以检索与类(class)相关的程序。因此我需要这样一个数据框,但我似乎无法找出必须使用哪些函数来实现这一点。
重要的是,字符串不能简单地折叠在一起并用“|”之类的东西分隔。
我尝试过诸如aggregate()、collapse()之类的函数,以及来自其他 stackoverflow 查询的其他建议,但他们的解决方案不适用于我的特定数据集。
最佳答案
你可以 group_by
在列和 summarise
这些组across
要通过折叠 paste
来合并的列像这样:
library(dplyr)
df %>%
group_by(coursecode, coursename, courseurl) %>%
summarise(across(programme_faculty:programme_ects, ~ paste(.x, collapse = ", ")))
#> # A tibble: 3 × 6
#> # Groups: coursecode, coursename [3]
#> coursecode coursename courseurl programme_faculty programme_name programme…¹
#> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 WBAS007 Introduction url3 FSE Ast 180
#> 2 WBPH001 Mechanics url1 FSE, FSE, FSE Mat, Bio, Ast 180, 180, …
#> 3 WBPH058 Calculus 2 url2 FSE Ast 180
#> # … with abbreviated variable name ¹programme_ects
您还可以list
他们像这样:
library(dplyr)
df %>%
group_by(coursecode, coursename, courseurl) %>%
summarise(across(programme_faculty:programme_ects, ~ list(.x)))
#> # A tibble: 3 × 6
#> # Groups: coursecode, coursename [3]
#> coursecode coursename courseurl programme_faculty programme_name programme…¹
#> <chr> <chr> <chr> <list> <list> <list>
#> 1 WBAS007 Introduction url3 <chr [1]> <chr [1]> <chr [1]>
#> 2 WBPH001 Mechanics url1 <chr [3]> <chr [3]> <chr [3]>
#> 3 WBPH058 Calculus 2 url2 <chr [1]> <chr [1]> <chr [1]>
#> # … with abbreviated variable name ¹programme_ects
创建于 2023 年 3 月 25 日 reprex v2.0.2
正如@zephryl所说,你可以替换 ~list(.x)
只需 list
.
关于r - 根据一组其他列中已识别的重复项将行折叠为多列的列表,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75842285/