我从格式不正确的 pdf 表格中读取了一些数据,其中的单元格有时跨越几页。这给我留下了一个看起来与此类似的数据框:
company_name <- c("company_a", NA, "company_a", "company_b", "company_b", NA)
text <- c("some_text", "text that should be in the above cell","some_text", "some_text", "some_text","text that should be in the above cell")
more_text <- c("some_text", "text that should be in the above cell", "some_text", "some_text", "some_text","text that should be in the above cell")
df <- data.frame(company_name, text, more_text)
如何合并“company_name”应该包含缺失值的行,使其看起来更像这样,并在以 NA 开头的所有行上循环:
我已经尝试过unheadr
包,但我似乎无法找出要使用的正确函数。
编辑:重新编写示例以使其更加清晰
最佳答案
我们根据 NA 元素 (ind
) 创建一个逻辑列,然后通过转换 'ind' 或 (|
) 来创建 'grp'该列的 >lead
到带有 rleid
的数字索引,使用 fill
将 NA
元素替换为之前的非 NA 'company_name,然后使用分组列并通过将元素粘贴
在一起来汇总
跨
其他列
library(dplyr)
library(tidyr)
library(stringr)
library(data.table)
df %>%
mutate(ind = is.na(company_name),
grp = rleid(ind|lead(ind))) %>%
fill(company_name) %>%
group_by(company_name, grp) %>%
summarise(across(contains('text'), str_c, collapse=" + "), .groups = 'drop') %>%
select(-grp)
# A tibble: 4 x 3
# company_name text more_text
# <chr> <chr> <chr>
#1 company_a some_text + text that should be in the above cell some_text + text that should be in the above cell
#2 company_a some_text some_text
#3 company_b some_text some_text
#4 company_b some_text + text that should be in the above cell some_text + text that should be in the above cell
数据
df <- data.frame(company_name = company_a, text, more_text)
关于r - 向上合并行,同时缺少列单元格中的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65908928/