我使用 extract_tables() 从 PDF 文件中提取了一个表格,但文本已分散在多行中。每个记录的行数各不相同。我想将文本组合成一个值。
我想做的类似于 this post 。不同之处在于我的文本位于多列中。每个条目使用的记录数是可变的,具体取决于每次不同的列。
示例:一个条目可能占用四行,因为“名称和位置”列分布在四行中(而其他列仅占用该条目的两行;其余的填充为 NA)。对于另一个条目,由于“专业知识”列中文本的长度,文本可能会分布在 6 行中。
每次“Level”列包含值而不是 NA 时,都会开始一条新记录。 编辑:“级别”值不唯一
我的数据如下所示:
Name & location Expertise Type Sector Payment Level
1: Ms. Jane Student Higher Government and payment 1
2: Doe, <NA> Education education has been <NA>
3: NUS <NA> institute <NA> received <NA>
4: Andrew Saunders Phd., Chief Municipal Government and payment 5
5: Municipality of Education government education has not <NA>
6: Amsterdam Officer <NA> <NA> been <NA>
7: <NA> <NA> <NA> <NA> received <NA>
8: Mr. Stephen Spokesperson for Municipal Government and payment 3
9: Johnson, Sustainability, government education has not <NA>
10: Orange County Health & <NA> <NA> been <NA>
11: <NA> Wellbeing and <NA> <NA> received <NA>
12: <NA> Wellfare <NA> <NA> <NA> <NA>
13: Mrs. Susan Junior national Government and payment 4
14: Andrews, Research government education has not <NA>
15: Police Manager <NA> <NA> been <NA>
16: <NA> Money <NA> <NA> received <NA>
17: <NA> Laundering <NA> <NA> <NA> <NA>
可重现的示例:
structure(list(`Name & location` = c("1: Ms. Jane", "2: Doe,",
"3: NUS", "4: Andrew Saunders Phd.,", "5: Municipality of",
"6: Amsterdam", "7: <NA>", "8: Mr. Stephen", "9: Johnson,",
"10: Orange County", "11: <NA>", "12: <NA>", "13: Mrs. Susan",
"14: Andrews,", "15: Police", "16: <NA>", "17: <NA>"),
Expertise = c("Student", NA, NA, "Chief", "Education", "Officer",
NA, "Spokesperson for", "Sustainability,", "Health &", "Wellbeing and",
"Wellfare", "Junior", "Research", "Manager", "Money", "Laundering"
), Type = c("Higher", "Education", "Insititute", "Municipal",
"Government", NA, NA, "Municipal", "Government", NA, NA,
NA, "National", "Government", NA, NA, NA), Sector = c("Government and",
"education", NA, "Government and", "education", NA, NA, "Government and",
"education", NA, NA, NA, "Government and", "education", NA,
NA, NA), Payment = c("payment", "has been", "received", "Payment",
"has not", "been", "received", "Payment", "has not", "been",
"received", NA, "Payment", "has not", "been", "received",
NA), Level = c(1, NA, NA, 5, NA, NA, NA, 3, NA, NA, NA, NA,
4, NA, NA, NA, NA)), row.names = c(NA, -17L), class = c("tbl_df",
"tbl", "data.frame"))
到目前为止我尝试的是下面代码的不同版本
DF_clean <- DF %>% mutate(Level = ifelse(grepl(NA, Level))) %>%
group_by(id = cumsum(!is.na(Level))) %>%
mutate(Level = first(Level)) %>%
group_by(Level) %>%
summarise(Name = paste(Name, collapse = " "),
Expertise = paste(Expertise, collapse = " "),
Type = paste(Type, collapse = " "),
Sector = paste(Sector, collapse = " "),
Level = paste(Level, collapse = " "))
但这似乎将所有文本折叠成单个记录。
关于如何解决这个问题有什么想法吗?
最佳答案
肯定有一些更漂亮的解决方案,但这似乎有效。它也适用于 Level
包含重复值。
# Remove row numbers and <NA> from Name & Location
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`))
# Compute ranges to merge
starts <- c(which(!is.na(df$Level)), nrow(df) + 1)
ranges <- sapply(
1:(length(starts) - 1),
function(x)
starts[x]:(starts[x + 1] - 1)
)
# Merge lines based on ranges
combined_df <- lapply(
ranges,
function(x)
lapply(df[x, ], function(x) gsub(" +$| NA", "", paste0(x, collapse = " ")))
) %>%
bind_rows
# A tibble: 4 x 6
`Name & location` Expertise Type Sector Payment Level
<chr> <chr> <chr> <chr> <chr> <chr>
1 Ms. Jane Doe, NUS Student Higher Education Insititute Government and education payment has been received 1
2 Andrew Saunders Phd., Municipality of Amsterdam Chief Education Officer Municipal Government Government and education Payment has not been received 5
3 Mr. Stephen Johnson, Orange County Spokesperson for Sustainability, Health & Wellbeing and Wellfare Municipal Government Government and education Payment has not been received 3
4 Mrs. Susan Andrews, Police Junior Research Manager Money Laundering National Government Government and education Payment has not been received 4
编辑:
我使用@Andrew的解决方案来计算新的 unique_level
列并使其发挥作用。恕我直言,它比我的第一个解决方案更漂亮:
library(tidyverse)
df <- df %>%
mutate(`Name & location` = gsub("[0-9]+:\\s+", "", `Name & location`)) %>%
mutate(`Name & location` = gsub("<NA>", "", `Name & location`)) %>%
mutate(unique_level = ifelse(!is.na(Level), 1, NA) * 1:nrow(df)) %>%
fill(unique_level, .direction = "down") %>%
group_by(unique_level) %>%
summarise_all(~ gsub(" +$| NA", "", paste(., collapse = " "))) %>%
select(-unique_level)
前两个mutate
调用删除行号和 <NA>
来自Name & location
柱子。 gsub
调用summarise_all
删除尾随空格和 NA
将行粘贴在一起时添加。
关于R-根据其他列中的非空行合并多列中可变数量的行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58434130/