我正在尝试从 FiveThirtyEight's presidential approval rating 抓取数据将日期、民意调查、样本大小和百分比放入 R 中的数据框中。我的第一次尝试是使用 html_nodes 的方法:
pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")
pres_approval <- pres_approval %>%
html_nodes(css = "table") %>%
nth(2) %>%
html_table(header = TRUE, fill = TRUE)
返回了
Error in nodes_duplicated(nodes) : Expecting an external pointer: [type=NULL].`
然后再次使用选择器小工具:
pres_approval <- read_html("https://projects.fivethirtyeight.com/trump-approval-ratings/")`
pres_approval <- pres_approval %>%
html_nodes(css = "td , .heat-map , .pollster a") %>%
nth(2) %>%
html_table(header = TRUE, fill = TRUE)`
返回了
Error in html_table.xml_node(., header = TRUE, fill = TRUE) : html_name(x) == "table" is not TRUE`
我可以从这里做什么?
最佳答案
它们通常通过 XHR 请求异步加载数据,如果您在浏览器中打开开发人员工具并重新加载页面,您可以看到这些请求。在 Network -> XHR 中你会看到很多可爱的 JSON:
我不知道你想要哪一个(我浏览了 Q),但你可以轻松获取所有主要 JSON 文件:
polls <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/polls.json")
str(polls, 1)
## 'data.frame': 3401 obs. of 14 variables:
## $ id : int 77261 77265 77272 77249 77257 77266 77596 77246 77263 77253 ...
## $ subgroup : chr "All polls" "All polls" "All polls" "All polls" ...
## $ sampleSize : int 1992 1500 1190 1043 1500 2692 1712 1500 1500 1991 ...
## $ population : chr "rv" "a" "rv" "rv" ...
## $ weight : num 0.946 0.245 1.645 1.166 0.639 ...
## $ grade : chr "B-" "B" "A-" "B" ...
## $ multiversions: logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ url : chr "http://www.politico.com/story/2017/01/poll-voters-liked-trumps-inaugural-address-234148" "http://www.gallup.com/poll/201617/gallup-daily-trump-job-approval.aspx" "https://poll.qu.edu/national/release-detail?ReleaseID=2415" "http://www.publicpolicypolling.com/pdf/2015/PPP_Release_National_12617.pdf" ...
## $ created_at : chr "2017-01-23" "2017-01-23" "2017-01-26" "2017-01-25" ...
## $ startDate : chr "2017-01-20" "2017-01-20" "2017-01-20" "2017-01-23" ...
## $ endDate : chr "2017-01-22" "2017-01-22" "2017-01-25" "2017-01-24" ...
## $ pollster : chr "Morning Consult" "Gallup" "Quinnipiac University" "Public Policy Polling" ...
## $ tracking : chr "" "T" "" "" ...
## $ answers :List of 3401
approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/approval.json")
str(approval, 1)
## 'data.frame': 2751 obs. of 9 variables:
## $ date : chr "2017-01-23" "2017-01-23" "2017-01-23" "2017-01-24" ...
## $ future : logi FALSE FALSE FALSE FALSE FALSE FALSE ...
## $ subgroup : chr "Adults" "All polls" "Voters" "Adults" ...
## $ approve_estimate : chr "45" "45.46693" "46" "45" ...
## $ approve_hi : chr "51.1347" "50.88971" "52.29238" "50.98562" ...
## $ approve_lo : chr "38.8653" "40.04416" "39.70762" "39.01438" ...
## $ disapprove_estimate: chr "45" "41.26452" "37" "45.74659" ...
## $ disapprove_hi : chr "51.1347" "46.68729" "43.29238" "51.73221" ...
## $ disapprove_lo : chr "38.8653" "35.84175" "30.70762" "39.76097" ...
historic_approval <- jsonlite::fromJSON("https://projects.fivethirtyeight.com/trump-approval-ratings/historical-approval.json")
str(historic_approval, 1)
## 'data.frame': 26001 obs. of 6 variables:
## $ president : chr "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" "Harry S. Truman" ...
## $ date : chr "1945-06-06" "1945-06-07" "1945-06-08" "1945-06-09" ...
## $ days : int 55 56 57 58 59 60 61 62 63 64 ...
## $ subgroup : chr "All polls" "All polls" "All polls" "All polls" ...
## $ approve_estimate : chr "87" "87" "87" "87" ...
## $ disapprove_estimate: chr "3" "3" "3" "3" ...
我会通过 readr::type_convert()
运行生成的数据帧以获得更好的类型。
关于r - 尝试从 FiveThirtyEight 抓取数据时出现错误,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53439989/