javascript - 使用 R to Scrape (rvest) 在 JS 网页上接受条款和条件

标签 javascript r web-scraping rvest

我还有另一个关于网络抓取的问题。我正在使用 Rvest 尝试从警方报告网站上抓取一些数据。我一直在四处寻找,但我似乎无法找到绕过网站“我同意”按钮的“接受条款和条件”的方法。如何提交“我同意”才能访问该网站?

网站= http://www.wspdp2c.org/Summary_Disclaimer.aspx

require(httr)
require(XML)
library(RCurl)
library(rvest)

wspd.url<- "http://www.wspdp2c.org/Summary_Disclaimer.aspx"

wspd.session<-html_session(wspd.url)
wspd.form<-html_form(read_html(wspd.session))
wspd.form

R 输出:

> wspd.form
[[1]]
<form> 'Form1' (POST ./Summary_Disclaimer.aspx)
  <input hidden> '_popupBlockerExists': true
  <input hidden> '__VIEWSTATE': /wEPDwUKLTUwMDM5Nzk4OA9....
  <input hidden> '__VIEWSTATEGENERATOR': 27903AD3
  <input hidden> '__EVENTVALIDATION': /wEdAAky7XCY2Cjbe0DHcJ....
  <select> 'ctl00$MasterPage$DDLSiteMap1$ddlQuickLinks' [1/7]
  <input submit> 'ctl00$MasterPage$mainContent$CenterColumnContent$btnContinue': I Agree

最佳答案

您需要弄清楚如何让 selenium 在您的系统上运行,以及如何让 remoteDr(...) 调用继续进行。之后,这应该可以帮助您入门:

library(seleniumPipes)
library(rvest)
library(dplyr)
library(stringi)
library(purrr)

remDr  <- remoteDr(...)

remDr %>% go("http://www.wspdp2c.org/Summary_Disclaimer.aspx")

submit <- remDr %>% findElement("xpath", ".//input[@type='submit']")
submit %>% elementClick()

from_date <- remDr %>% findElement("xpath", ".//input[@name='MasterPage$mainContent$txtDateFrom2']")
from_date %>% elementClear()
from_date %>% elementSendKeys("12/22/2016")
to_date %>% elementSendKeys("12/23/2016", selKeys$escape) # esc clears the popup calednar

to_date <- remDr %>% findElement("xpath", ".//input[@name='MasterPage$mainContent$txtDateTo2']")
to_date %>% elementClear()
to_date %>% elementSendKeys("12/23/2016", selKeys$escape)

search <- remDr %>% findElement("class name", "ui-icon-search")
search %>% elementClick()

remDr %>% getPageSource() -> pg
html_nodes(pg, "table.DataGridText") -> tab

html_nodes(tab, xpath=".//td[2]")[1:9] %>% 
  html_text() %>% 
  as.POSIXct(format="%m/%d/%Y %H:%M") -> occurred

html_nodes(tab, xpath=".//td[3]")[1:9] %>% 
  html_text() -> incident_or_arrest

html_nodes(tab, xpath=".//td[4]")[1:9] %>%
  html_text() %>% 
  stri_trim_both() -> case_or_arrestee

stri_match_all_regex(case_or_arrestee,
                     paste0(c("Case #: ([[:digit:]]+)",
                       "Primary Offense: ([[:print:]]+)",
                       "Arrestee: ([[:print:]]+)",
                       "Charge: ([[:print:]]+)"), collapse="|")) %>% 
  map(~apply(.[,2:5], 1, discard, is.na)) %>% 
  map_df(function(x) {
    x <- as.list(x)
    if (stri_detect_regex(x[[1]], "[[:alpha:]]")) {
      setNames(x, c("arrestee", "charge"))
    } else {
      setNames(x, c("case_number", "primary_offense"))
    }
  }) -> case_or_arrestee

html_nodes(tab, xpath=".//td[5]")[1:9] %>% 
  html_text() -> location

data_frame(occurred, incident_or_arrest, location) %>% 
  bind_cols(case_or_arrestee) %>% 
  glimpse()
## Observations: 9
## Variables: 7
## $ occurred           <dttm> 2016-12-22 00:00:00, 2016-12-22 00:00:00, 2016-12-22 00:0...
## $ incident_or_arrest <chr> "Incident", "Incident", "Arrest", "Incident", "Incident", ...
## $ location           <chr> "2600-BLK    TODDLER PLACE DR", "300-BLK    ALSPAUGH DR", ...
## $ case_number        <chr> "1667276", "1667273", NA, "1667249", "1667248", NA, NA, "1...
## $ primary_offense    <chr> "BREAKING & ENTERING WITH FORCE", "MALICIOUS INJURY TO PRO...
## $ arrestee           <chr> NA, NA, "THOMAS, KERRY MARTIN", NA, NA, "LOZANO, MIGUEL AR...
## $ charge             <chr> NA, NA, "PANHANDLING W/ NO PRIVLEDGE LICENSE", NA, NA, "AN...

关于javascript - 使用 R to Scrape (rvest) 在 JS 网页上接受条款和条件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41317373/

相关文章:

javascript - 如何使用jquery在twitter bootstrap3中使用nav nav-pills过滤投资组合网格

r - 给定位置和条件,用 R 改变向量中的值

r - 如何使用 if 创建 df 列

python - 目录中没有该文件

css - 如何使用 rvest 从搜索结果 url 中提取 id 名称? (CSS 选择器不起作用)

javascript - 使用 Jasmine 在 AngularJS 中测试去抖函数永远不会调用该函数

javascript - 如何禁用 CTRL + 鼠标左键单击?

javascript - 自定义 HTML 属性的浏览器版本底限

r - 计算按多列分组的列中每对可能的值

r - 如何在没有按钮参数的 Rvest 包中提交登录表单