r - 如果错误大于R

标签 r error-handling web-scraping

我正在从网上下载天气数据。为此,我创建了一个简单的for循环,该循环将带有数据的数据框添加到列表中(一个城市的一个列表)。它工作正常,但是如果没有数据(网络上没有针对特定日期的天气状况表),则会返回错误-例如该网址(“https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland”)。

Error in Lublin[i] <- url4 %>% read_html() %>% html_nodes(xpath = "//*[@id=\"obsTable\"]") %>%  : 
  replacement has length zero

当发生错误并将其放入列表时,如何放置if语句,该语句返回具有NA(13个观察值)的行?

还有比for循环更快的下载所有数据的方法吗?

我的代码:
c<-seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days")
Warszawa <- list()
Wroclaw <- list()
Bydgoszcz <- list()
Lublin <- list()
Gorzow <- list()
Lodz <- list()
Krakow <- list()
Opole <- list()
Rzeszow <- list()
Bialystok <- list()
Gdansk <- list()
Katowice <- list()
Kielce <- list()
Olsztyn <- list()
Poznan <- list()
Szczecin <- list()
date <- list()
for(i in 1:length(c)) {
y<-as.numeric(format(c[i],'%Y'))
m<-as.numeric(format(c[i],'%m'))
d<-as.numeric(format(c[i],'%d'))
date[i] <- c[i]
url1 <- sprintf("https://www.wunderground.com/history/airport/EPWA/%d/%d/%d/DailyHistory.html?req_city=Warszawa&req_state=MZ&req_statename=Poland", y, m, d)
url2 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Wrocław&req_statename=Poland", y, m, d)
url3 <- sprintf("https://www.wunderground.com/history/airport/EPBY/%d/%d/%d/DailyHistory.html?req_city=Bydgoszcz&req_statename=Poland", y, m, d)
url4 <- sprintf("https://www.wunderground.com/history/airport/EPLB/%d/%d/%d/DailyHistory.html?req_city=Abramowice%%20Koscielne&req_statename=Poland", y, m, d)
url5 <- sprintf("https://www.wunderground.com/history/airport/EPZG/%d/%d/%d/DailyHistory.html?req_city=Gorzow%%20Wielkopolski&req_statename=Poland", y, m, d)
url6 <- sprintf("https://www.wunderground.com/history/airport/EPLL/%d/%d/%d/DailyHistory.html?req_city=Lodz&req_statename=Poland", y, m, d)
url7 <- sprintf("https://www.wunderground.com/history/airport/EPKK/%d/%d/%d/DailyHistory.html?req_city=Krakow&req_statename=Poland", y, m, d)
url8 <- sprintf("https://www.wunderground.com/history/airport/EPWR/%d/%d/%d/DailyHistory.html?req_city=Opole&req_statename=Poland", y, m, d)
url9 <- sprintf("https://www.wunderground.com/history/airport/EPRZ/%d/%d/%d/DailyHistory.html?req_city=Rzeszow&req_statename=Poland", y, m, d)
url10 <- sprintf("https://www.wunderground.com/history/airport/UMMG/%d/%d/%d/DailyHistory.html?req_city=Dojlidy&req_statename=Poland", y, m, d)
url11 <- sprintf("https://www.wunderground.com/history/airport/EPGD/%d/%d/%d/DailyHistory.html?req_city=Gdansk&req_statename=Poland", y, m, d)
url12 <- sprintf("https://www.wunderground.com/history/airport/EPKM/%d/%d/%d/DailyHistory.html?req_city=Katowice&req_statename=Poland", y, m, d)
url13 <- sprintf("https://www.wunderground.com/history/airport/EPKT/%d/%d/%d/DailyHistory.html?req_city=Chorzow%%20Batory&req_statename=Poland", y, m, d)
url14 <- sprintf("https://www.wunderground.com/history/airport/EPSY/%d/%d/%d/DailyHistory.html", y, m, d)
url15 <- sprintf("https://www.wunderground.com/history/airport/EPPO/%d/%d/%d/DailyHistory.html?req_city=Poznan%%20Old%%20Town&req_statename=Poland", y, m, d)
url16 <- sprintf("https://www.wunderground.com/history/airport/EPSC/%d/%d/%d/DailyHistory.html?req_city=Szczecin&req_statename=Poland", y, m, d)

Warszawa[i] <- url1 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Wroclaw[i] <- url2 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Bydgoszcz[i] <- url3 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Lublin[i] <- url4 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Gorzow[i] <- url5 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Lodz[i] <- url6 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Krakow[i] <- url7 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Opole[i] <- url8 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Rzeszow[i] <- url9 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Bialystok[i] <- url10 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Gdansk[i] <- url11 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Katowice[i] <- url12 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Kielce[i] <- url13 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Olsztyn[i] <- url14 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Poznan[i] <- url15 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()
Szczecin[i] <- url16 %>%
  read_html() %>%
  html_nodes(xpath='//*[@id="obsTable"]') %>%
  html_table()

}

感谢帮助。

最佳答案

首先,我被带走了一点,答案比最初计划的要长一些。我决定为您解决三个问题:确定有效URL的重复问题;在获取这些URL的相关信息时出现重复问题;还有刮时的错误问题。

因此,在这里,您将希望以更简单的方式获取要抓取的链接:

library(httr)
library(rvest)

## All the dates:
 dates <- seq(as.Date("2015/1/1"), as.Date("2016/12/31"), "days")
 dates <- gsub("-", "/", x = dates)

## All the regions and links:
 abbreviations <- c("EPWA", "EPWR", "EPBY", "EPLB", "EPZG", "EPLL", "EPKK",          
                      "EPWR", "EPRZ", "UMMG", "EPGD", "EPKM", "EPKT",
                      "EPSY", "EPPO", "EPSC")

links <- paste0("https://www.wunderground.com/history/airport/", 
                abbreviations, "/")
links <- lapply(links, function(x){paste0(x, dates, "/DailyHistory.html")})

现在我们在links中拥有所有链接,我们将定义一个函数,该函数将检查链接并抓取HTML并获取我们想要的任何信息。在您的情况下,应为:城市名称,日期和天气表。我决定使用城市名称和日期作为对象的名称,因此您可以轻松地将哪个天气表属于哪个城市和日期:
## Get the weather report & name 
get_table <- function(link){
  # Get the html from a link
   html <- try(link %>%
             read_html())
   if("try-error)" %in% class(html)){
         print("HTML not found, skipping to next link")
         return("HTML not found, skipping to next link")
   }

   # Get the weather table from that page
   weather_table <- html %>%
     html_nodes(xpath='//*[@id="obsTable"]') %>%
     html_table()
   if(length(weather_table) == 0){
     print("No weather table available for this day")
     return("No weather table available for this day")
   }

   # Use info from the html to get the city, for naming the list
   region <- html %>%
     html_nodes(xpath = '//*[@id="location"]') %>%
     html_text()
   region <- strsplit(region, "[1-9]")[[1]][1]
   region <- gsub("\n", "",  region)
   region <- gsub("\t\t", "", region)

   # Use info from the html to get the date, and name the list
   which_date <- html %>%
    html_nodes(xpath = '//*[@class="history-date"]') %>%
    html_text()

   city_date <- paste0(region, which_date)

   # Name the output
   names(weather_table) <- city_date

   print(paste0("Just scraped ", city_date))
   return(weather_table)
 }

运行此功能应该适用于我们确定的所有URL,包括您在问题中发布的错误URL
# A little test-run, to see if your faulty URL works:
  testlink      <- "https://www.wunderground.com/history/airport/EPLB/2015/12/25/DailyHistory.html?req_city=Abramowice%20Koscielne&req_statename=Poland"
  links[[1]][5] <- testlink
  tested        <- sapply(links[[1]][1:6], get_table, USE.NAMES = FALSE)
  # [1] "Just scraped Warsaw, Poland Thursday, January 1, 2015"
  # [1] "Just scraped Warsaw, Poland Friday, January 2, 2015"
  # [1] "Just scraped Warsaw, Poland Saturday, January 3, 2015"
  # [1] "Just scraped Warsaw, Poland Sunday, January 4, 2015"
  # [1] "No weather table available for this day"
  # [1] "Just scraped Warsaw, Poland Tuesday, January 6, 2015"

就像一个护身符一样工作,因此您可以使用以下循环获取波兰的天气数据:
# For all sublists in links (corresponding to cities)
# scrape all links (corresponding to days)
city <- rep(list(list()), length(abbreviations))
for(i in 1:length(links)){
  city[[i]] <- sapply(links[[i]], get_table, USE.NAMES = FALSE)
}

关于r - 如果错误大于R,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43389773/

相关文章:

r - 如何从 Shiny 应用程序中访问浏览器 session /cookie

r - 将数据框列表写入多个 excel 文件

r - 计算向量中每 n 个值的平均值

ruby-on-rails - 在Rails中未定义的局部变量或方法 `welcome_goodbye_path'?

Python 和 BS4 - 奇怪的行为,刮刀在一段时间后卡住/停止工作而没有错误

javascript - 如何检查html是否已更改?

html - 网站元素显示在浏览器中,但当我尝试使用检查元素访问它时隐藏

r - R中的ggplot或qplot直方图

error-handling - SBCL:在 COMPILE 期间缺少 CONTINUE 重新启动? (真的是:在 HANDLER-CASE 中不存在)

ruby-on-rails - .js.erb文件可用于所有其他 View ?