我有以下功能,它允许我从其 URL 中抓取 Wikipedia 内容(确切内容与此问题无关)
getPageContent <- function(url) {
library(rvest)
library(magrittr)
pc <- html(url) %>%
html_node("#mw-content-text") %>%
# strip tags
html_text() %>%
# concatenate vector of texts into one string
paste(collapse = "")
pc
}
在特定 URL 上使用该函数时,这有效。
getPageContent("https://en.wikipedia.org/wiki/Balance_(game_design)")
[1] "In game design, balance is the concept and the practice of tuning a game's rules, usually with the goal of preventing any of its component systems from being ineffective or otherwise undesirable when compared to their peers. An unbalanced system represents wasted development resources at the very least, and at worst can undermine the game's entire ruleset by making impo (...)
但是,如果我想将该函数传递给
dplyr
要获取多个页面的内容,我收到一个错误:example <- data.frame(url = c("https://en.wikipedia.org/wiki/Balance_(game_design)",
"https://en.wikipedia.org/wiki/Koncerthuset",
"https://en.wikipedia.org/wiki/Tifama_chera",
"https://en.wikipedia.org/wiki/Difference_theory"),
stringsAsFactors = FALSE
)
library(dplyr)
example <- mutate(example, content = getPageContent(url))
Error: length(url) == 1 ist nicht TRUE
In addition: Warning message:
In mutate_impl(.data, dots) :
the condition has length > 1 and only the first element will be used
查看错误,我认为问题出在
getPageContent
无法处理 URL 向量,但我不知道如何解决它。++++
编辑:两个建议的解决方案 - 1)使用
rowwise()
和 2) 使用 sapply()
两者都运行良好。用 10 篇随机 WP 文章进行模拟,第二种方法快 25%:> system.time(
+ example <- example %>%
+ rowwise() %>%
+ mutate(content = getPageContent(url))
+ )
User System verstrichen
0.39 0.14 1.21
>
>
> system.time(
+ example$content <- unlist(lapply(example$url, getPageContent))
+ )
User System verstrichen
0.49 0.11 0.90
最佳答案
您可以使用 rowwise()
它会起作用
res <- example %>%
rowwise() %>%
mutate(content=getPageContent(url))
关于R - 将向量传递给自定义函数到 dplyr::mutate,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32033815/