r - 尝试从 FiveThirtyEight 抓取数据时出现错误

标签 r web-scraping rvest

我正在尝试从 FiveThirtyEight's presidential approval rating 抓取数据将日期、民意调查、样本大小和百分比放入 R 中的数据框中。我的第一次尝试是使用 html_nodes 的方法:

pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")

pres_approval <- pres_approval %>%
                     html_nodes(css = "table") %>%
                     nth(2) %>%
                     html_table(header = TRUE, fill = TRUE)

返回了

Error in nodes_duplicated(nodes) : Expecting an external pointer: [type=NULL].`

然后再次使用选择器小工具:

 pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")`

 pres_approval <- pres_approval %>%
                      html_nodes(css = "td , .heat-map , .pollster a") %>%
                      nth(2) %>%
                      html_table(header = TRUE, fill = TRUE)`

返回了

Error in html_table.xml_node(., header = TRUE, fill = TRUE) : html_name(x) == "table" is not TRUE`

我可以从这里做什么?

最佳答案

它们通常通过 XHR 请求异步加载数据,如果您在浏览器中打开开发人员工具并重新加载页面,您可以看到这些请求。在 Network -> XHR 中你会看到很多可爱的 JSON:

enter image description here

我不知道你想要哪一个(我浏览了 Q),但你可以轻松获取所有主要 JSON 文件:

polls <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/polls.json")

str(polls, 1)
## 'data.frame': 3401 obs. of  14 variables:
##  $ id           : int  77261 77265 77272 77249 77257 77266 77596 77246 77263 77253 ...
##  $ subgroup     : chr  "All polls" "All polls" "All polls" "All polls" ...
##  $ sampleSize   : int  1992 1500 1190 1043 1500 2692 1712 1500 1500 1991 ...
##  $ population   : chr  "rv" "a" "rv" "rv" ...
##  $ weight       : num  0.946 0.245 1.645 1.166 0.639 ...
##  $ grade        : chr  "B-" "B" "A-" "B" ...
##  $ multiversions: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ url          : chr  "http://www.politico.com/story/2017/01/poll-voters-liked-trumps-inaugural-address-234148" "http://www.gallup.com/poll/201617/gallup-daily-trump-job-approval.aspx" "https://poll.qu.edu/national/release-detail?ReleaseID=2415" "http://www.publicpolicypolling.com/pdf/2015/PPP_Release_National_12617.pdf" ...
##  $ created_at   : chr  "2017-01-23" "2017-01-23" "2017-01-26" "2017-01-25" ...
##  $ startDate    : chr  "2017-01-20" "2017-01-20" "2017-01-20" "2017-01-23" ...
##  $ endDate      : chr  "2017-01-22" "2017-01-22" "2017-01-25" "2017-01-24" ...
##  $ pollster     : chr  "Morning Consult" "Gallup" "Quinnipiac University" "Public Policy Polling" ...
##  $ tracking     : chr  "" "T" "" "" ...
##  $ answers      :List of 3401

approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/approval.json")

str(approval, 1)
## 'data.frame': 2751 obs. of  9 variables:
##  $ date               : chr  "2017-01-23" "2017-01-23" "2017-01-23" "2017-01-24" ...
##  $ future             : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
##  $ subgroup           : chr  "Adults" "All polls" "Voters" "Adults" ...
##  $ approve_estimate   : chr  "45" "45.46693" "46" "45" ...
##  $ approve_hi         : chr  "51.1347" "50.88971" "52.29238" "50.98562" ...
##  $ approve_lo         : chr  "38.8653" "40.04416" "39.70762" "39.01438" ...
##  $ disapprove_estimate: chr  "45" "41.26452" "37" "45.74659" ...
##  $ disapprove_hi      : chr  "51.1347" "46.68729" "43.29238" "51.73221" ...
##  $ disapprove_lo      : chr  "38.8653" "35.84175" "30.70762" "39.76097" ...

historic_approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/historical-approval.json")

str(historic_approval, 1)
## 'data.frame': 26001 obs. of  6 variables:
##  $ president          : chr  "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" ...
##  $ date               : chr  "1945-06-06" "1945-06-07" "1945-06-08" "1945-06-09" ...
##  $ days               : int  55 56 57 58 59 60 61 62 63 64 ...
##  $ subgroup           : chr  "All polls" "All polls" "All polls" "All polls" ...
##  $ approve_estimate   : chr  "87" "87" "87" "87" ...
##  $ disapprove_estimate: chr  "3" "3" "3" "3" ...

我会通过 readr::type_convert() 运行生成的数据帧以获得更好的类型。

关于r - 尝试从 FiveThirtyEight 抓取数据时出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53439989/

相关文章:

r - 饼图标签被切断

r - 对数刻度 R 上的指数拟合

r - 根据 "as.numeric"因子绘制数值向量时更改轴标签

r - 匹配一个序列。获取遵循该模式的元素的索引

jquery - 从站点收集特定信息并将其显示在我的站点上

javascript - Ghost.py - 单击特定按钮

python - 使用xpath提取图像

html - 使用rvest,如何从submit_form()返回的对象中提取html内容

html - 如何在 R 中获取简单的 HTML 表单?

rvest 包 - 如果没有找到属性,html_text() 是否可以存储 NA 值?