javascript - 使用返回错误的小 js 脚本使用 R 和 phantomjs 进行 Web 抓取

我需要从此页面获取包含一些脚本的内容: https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11 $$&provider=acute&locale=de. 对于包含 js 的其他页面，它工作正常但不适合我需要的页面。

phantomjs.exe在根目录下，系统调用成功调用(win7 64位):

system("phantomjs WebScrapeV1.js")

java脚本文件WebScrapeV1.js如下:

var url ='https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11$$&provider=acute&locale=de';
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {
  just_wait();
});
function just_wait() {
  setTimeout(function() {
    fs.write('WebScrapeV1.html', page.content, 'w');
    phantom.exit();
  }, 2500);
}

这是我得到的错误:

错误:[mobx.array] 索引越界，函数 (t) {return{key:t.version,text:t["name_"+e.root.navigation.lang],value:t.version }}大于30

https://grouper.swissdrg.org/packs/App-3dd15966701d9f6fd4db.js:1在 br 未处理的 promise 拒绝 TypeError: undefined is not a constructor (evaluating 'n.push(this.pdx)')

最佳答案

您可能需要更长的超时时间。我不得不使用 3600 来获取所有内容(该站点对我来说 super 慢)。这是一种可以在发生错误时修改超时的方法，而无需手动修改 phantomjs 脚本。

首先，我们将创建一个函数来包装所有的复杂性:

#' Read contents from a URL with phantomjs
#' 
#' @param url the URL to scrape
#' @param timeout how long to wait, default is `2500` (ms)
#' @param .verbose, if `TRUE` (the default), display the generated 
#'        scraping script and any `stdout` output from phantomjs
read_phantom <- function(url, timeout=2500, .verbose = TRUE) {

  suppressPackageStartupMessages({
    require("glue", character.only = TRUE, quiet=TRUE)
    require("crayon", character.only = TRUE, quiet=TRUE)
  })

  phantom_template <- "
var url = {url};
var page = new WebPage()
var fs = require('fs');
page.open(url, function (status) {{
  just_wait();
});
function just_wait() {{
  setTimeout(function() {{
    fs.write({output_file}, page.content, 'w');
    phantom.exit();
  }, {timeout});
}
" 

  url <- shQuote(url)

  phantom_bin <- Sys.which("phantomjs")

  tf_in <- tempfile(fileext = ".js")
  on.exit(unlink(tf_in), add=TRUE)

  tf_out <- tempfile(fileext = ".html")
  on.exit(unlink(tf_out), add=TRUE)

  output_file <- shQuote(tf_out)

  phantom_script <- glue(phantom_template)

  if (.verbose) {
    cat(
      crayon::white("Using the following generated scraping script:\n"),
      crayon::green(phantom_script), "\n", sep=""
    )
  }

  writeLines(phantom_script, tf_in)

  system2(
    command = phantom_bin, 
    args = tf_in,
    stdout = if (.verbose) "" else NULL
  )

  paste0(readLines(tf_out, warn = FALSE), collapse="\n")

}

现在，我们将使用您的超时时间更长的 URL:

read_phantom(
  url = "https://grouper.swissdrg.org/swissdrg/single?version=7.3&pc=1337_70_0_0_M_11_00_15_0_2018/08/07_2018/08/22_C18.4_C07_-_45.81.11$$&provider=acute&locale=de",
  timeout = 3600
) -> doc

substr(doc, 1, 100)
## [1] "<html><head>\n<script src=\"https://js-agent.newrelic.com/nr-1071.min.js\"></script><script type=\" text"

nchar(doc)
## [1] 26858

请注意，phantomjs 被认为是一种遗留工具，因为自从 headless Chrome 出现以来，主要开发人员已经离开了。不幸的是，没有办法在简单的 cmd 行界面中为 headless Chrome 设置超时，所以你现在有点被 phantomjs 困住了。

我建议尝试 splashr但是你在 Windows 上并且 splashr 需要 Docker；或者，decapitated有一个编排对应 gepetto但这需要nodejs；这些组合中的任何一个对于开始使用该旧版操作系统的人们来说似乎都是一种痛苦。

关于javascript - 使用返回错误的小 js 脚本使用 R 和 phantomjs 进行 Web 抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52684996/

javascript - 使用返回错误的小 js 脚本使用 R 和 phantomjs 进行 Web 抓取

上一篇：javascript - 为什么要使用 "listener rect"来设置 d3 中的缩放？

下一篇：javascript - 收到状态为 200 且没有正文的 HTTP 请求