selenium - 使用 Phantomjs/Selenium(来自 R)进行网页抓取,设置元素值

标签 selenium web-scraping phantomjs automated-tests rselenium

http://www.nasdaqomx.com/commodities/market-prices抓取表格数据时出现问题

我可以获取数据,但我似乎无法更改/设置页面上的参数,因此无法检索其他数据。

这些是我可以在页面上找到的 id:

'#marketSelectId、#typesSelectId、#productsSelectId、#dateId、#isTraded、#excelId'

我需要更改的似乎是(来自 Chrome 的选择器 gadet):

'#marketSelectId, #isTraded'(来自网页末尾的代码)

有关如何更改这些的任何帮助。

我的 phantomjs 尝试如下: //幻影NasdaqOmx.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'NasdaqOmx.html';

page.open('http://www.nasdaqomx.com/commodities/market-prices/history/',
function (status) {

// no luck
//  page.evaluate(function(){
// document.getElementById("#isTraded").value = false;
//  });

// no luck
//  $('.myCheckbox').removeAttr('checked');

// no luck
page.evaluate(function(){
document.getElementById('marketSelectId').value='EUK';

});

var content = page.content;
fs.write(path,content,'w');

phantom.exit();
});

我的铼尝试

require('RSelenium')
library('XML')

remDr <- remoteDriver(remoteServerAddr = "localhost" 
                  , port = 32770L
                  , browserName = "firefox"
)

remDr$open()

site <- "http://www.nasdaqomx.com/commodities/market-prices" # create URL for each page to scrape
remDr$navigate(site) # navigates to webpage
## remDr$findElements(using = 'xpath', value = '//*@id')
remDr$executeScript("document.getElementById('marketSelectId').setAttribute('value', 'EUK')")

remDr$executeScript("document.getElementById('isTraded').setAttribute('value', '')");
##a <- remDr$executeScript("document.getElementById('isTraded').getAttribute('value')")
## remDR$ findElement(By.id("isTraded")).getAttribute("value");
##
##  Throws error
##  remDr$click(buttonId = 'isTraded')

elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string

## elem$highlightElement() # just for interactive use in browser.  not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)

head(master)

marketSelectId - 所需的值和脚本信息:'eno'、'ede'、'euk'

//*[(@id = "marketSelectId")]
webpage js code
<label>Market:</label> <select id="marketSelectId">
    <!--optgroup label="Electricity"-->
    <option selected="selected" value="ENO">Electricity Nordic</option>
    <option value="EBE">Electricity Belgium</option>
    <option value="EFR">Electricity France</option>
    <option value="EDE">Electricity Germany</option>
    <option value="EIT">Electricity Italy</option>
    <option value="ENL">Electricity Netherlands</option>
    <option value="EES">Electricity Spain</option>
    <option value="EUK">Electricity UK</option>
    <!--/optgroup-->
    <option value="EUA">Carbon Market</option>
    <option value="ZEE">Natural Gas Belgium</option>        
    <option value="PNO">Natural Gas France</option>
    <option value="GPO">Natural Gas Germany</option>
    <option value="TTF">Natural Gas Netherlands</option>
    <option value="NGUK">Natural Gas UK</option>
    <!--option value="ELEUR">Electricity Certificates</option-->
    <option value="ELSEK">Swedish Electricity Certificate</option>
    <option value="NCFO">Fuel Oil</option>
    <option value="NCDF">Freight - Dry</option>
    <option value="NCTC">Freight - Tankers Clean</option>
    <option value="NCTD">Freight - Tankers Dirty</option>
    <!--option value="COAL">Coal</option-->
    <option value="NCSF">Seafood</option>
    <option value="STEEL">Steel</option>
    <option value="NCIO">Iron Ore</option>
    <option value="RWEU">Renewables</option>
    <option value="COKCOAL">Coking Coal</option>
</select>

isTraded - 脚本信息并希望从已检查更改为“未检查”(不知道此字段的正确值,代码似乎检查“已检查”等,但这不起作用

//*[(@id = "isTraded")]
webpage js code
        // only those who have oi or volume
    if ( $("#isTraded").is(":checked")) {
        xpath += "[ph/hi/@rv!='' or ph/hi/@tv!='']"; //or ph/hi/@oi!=''

最佳答案

您需要使用clickElement方法。您还可以使用 selectTag 方法来操作选择菜单

library(RSelenium)
library(XML)
rD <- rsDriver()
remDr <- rD[["client"]]
remDr$navigate("http://www.nasdaqomx.com/commodities/market-prices")
isTraded <- remDr$findElement("id", "isTraded")
isTraded$clickElement()
waitforupdate(remDr)
marketSelect <- remDr$findElement("id", "marketSelectId")
msSelect <- marketSelect$selectTag()
# select seafood market
seafood <- msSelect$elements[msSelect$text == "Seafood"][[1]]
# switch to seafood market
seafood$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string

## elem$highlightElement() # just for interactive use in browser.  not necessary.
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)

# function to wait for update to appear
waitforupdate <- function(remDr, maxwait = 30){
  chk <- FALSE
  count <- 0L
  while(!chk && count < maxwait){
    count <- count + 1L
    res <- suppressMessages(
      tryCatch({
        remDr$findElement("css", "#derivatesNordicOutput span[title = 'Last update']")
      },
      error = function(e){e}
      )
    )
    chk <- !inherits(res, "error")
    Sys.sleep(1L)
  }
  if(count >= maxwait){
    stop("table has not updated in alloted time")
  }
}

# UPDATE get german electric prices
gerElec <- msSelect$elements[msSelect$text == "Electricity Germany"][[1]]
gerElec$clickElement()
waitforupdate(remDr)
elem <- remDr$findElement(using="id", value="derivatesNordicOutput") # get big table in text string
elemtxt <- elem$getElementAttribute("outerHTML")[[1]] # gets us the HTML
elemxml <- htmlTreeParse(elemtxt, useInternalNodes=T) # parse string into HTML tree to allow for querying with XPath
readHTMLTable(elemxml)


# close browser and stop server
remDr$close()
rD[["server"]]$stop()

关于selenium - 使用 Phantomjs/Selenium(来自 R)进行网页抓取,设置元素值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41869835/

相关文章:

r - XPath:选择没有类属性的元素

html - 使用特定 URL 和脚本构建代理旋转器

python - 从日期选择器获取可用性

python - Selenium:启动 Web 驱动程序实例时引发太多 Chrome 进程

java - 缩小 React 错误#310;在 React 中使用 Chrome 通过 Selenium 使用非缩小的开发环境来获取完整错误和其他有用的警告

python - 用 selenium 打开 Tor 浏览器

python请求.status_code未返回正确的值

javascript - phantomjs 总是在 Windows 上给出解析错误

javascript - JS 测试 : Trigger jQuery keypress event from CasperJS and PhanthomJS

javascript - PhantomJS Node - page.open - 无法跟踪多个页面