r - 在 R rvest 中对 url 和 html 节点进行网络抓取循环

标签 r loops web-scraping rvest

我有一个数据框pubs,有两列:urlhtml.node。我想编写一个循环,读取每个 url 并检索 html 内容,并提取 html.node 列指示的信息,并将其累积在数据框或列表中。
所有 URL 都不同,所有 html 节点都不同。
到目前为止我的代码是:

score <- vector()
k <- 1
for (r in 1:nrow(pubs)){
  art.url <- pubs[r, 1] # column 1 contains URL
  art.node <- pubs[r, 2] # column 2 contains html nodes as charcters

  art.contents <- read_html(art.url)
  score <- art.contents %>% html_nodes(art.node) %>% html_text()
  k<-k+1
  print(score)
}

感谢您的帮助。

最佳答案

首先,请确保您要抓取的每个网站都允许您抓取数据,如果违反某些规则,可能会引发法律问题。

(注意,我只使用了 http://toscrape.com/ ,一个用于抓取的沙箱网站,因为您没有提供数据)

之后,您可以继续执行此操作,希望它有所帮助:

# first, your data I think they're similar to this
pubs <- data.frame(site = c("http://quotes.toscrape.com/",
                            "http://quotes.toscrape.com/"),
                   html.node = c(".text",".author"), stringsAsFactors = F)

然后是您需要的循环:

library(rvest)
# an empty list, to fill with the scraped data
empty_list <- list()

# here you are going to fill the list with the scraped data
for (i in 1:nrow(pubs)){
  art.url <- pubs[i, 1]   # choose the site as you did
  art.node <- pubs[i, 2]  # choose the node as you did      
  # scrape it!    
  empty_list[[i]] <- read_html(art.url)  %>% html_nodes(art.node) %>% html_text()

}

现在结果是一个列表,但是:

names(empty_list) <- pubs$site

您将向列表的每个元素添加站点名称,结果为:

$`http://quotes.toscrape.com/`
 [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"                
 [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"                                              
 [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
 [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"                           
 [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"                    
 [6] "“Try not to become a man of success. Rather become a man of value.”"                                                                
 [7] "“It is better to be hated for what you are than to be loved for what you are not.”"                                                 
 [8] "“I have not failed. I've just found 10,000 ways that won't work.”"                                                                  
 [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"                                              
[10] "“A day without sunshine is like, you know, night.”"                                                                                 

$`http://quotes.toscrape.com/`
 [1] "Albert Einstein"   "J.K. Rowling"      "Albert Einstein"   "Jane Austen"       "Marilyn Monroe"    "Albert Einstein"   "André Gide"       
 [8] "Thomas A. Edison"  "Eleanor Roosevelt" "Steve Martin"   

显然它应该适用于不同的站点和不同的节点。

关于r - 在 R rvest 中对 url 和 html 节点进行网络抓取循环,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54917918/

相关文章:

r - 如何将自定义长注释 geom_text 放入 donut chart 的绘图区域内?

r - 如何根据特定年份对列求和?

r - 在 glm() 中使用 splines 包中的 ns() 函数

python - 如何使用 Urllib2 更有效地抓取?

python - 正则表达式在 python 中匹配和清理引号

正则表达式删除 <> 之间的所有内容

R:添加自定义刻度线标签

python - 有什么区别?< for item in list> 和 <for i in range(len(list))>

python - 如何迭代数据帧行,以更Pythonic的方式替换匹配元组中的值?

c++ - 如何从 C++ 中的函数返回结构