javascript - 抓取引用 R 中外部 javascript 脚本的 Javascript 渲染网页

标签 javascript r web-scraping phantomjs

我正在尝试抓取此网页:https://www.mustardbet.com/sports/events/302698

由于网页似乎是动态渲染的,因此我遵循本教程: https://www.datacamp.com/community/tutorials/scraping-javascript-generated-data-with-r#gs.dZEqev8

按照教程的建议,我使用以下代码保存一个名为“scrape_mustard.js”的文件:

// scrape_mustard.js

var webPage = require('webpage');
var page = webPage.create();

var fs = require('fs');
var path = 'mustard.html'

page.open('https://www.mustardbet.com/sports/events/302698', function (status) {
  var content = page.content;
  fs.write(path,content,'w')
  phantom.exit();
});

然后我表演

system("./phantomjs scrape_mustard.js")

但我收到错误:

ReferenceError: Can't find variable: Set

  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1 in t
  https://www.mustardbet.com/assets/js/index.dfd873fb.js:1

现在,当我将“https://www.mustardbet.com/assets/js/index.dfd873fb.js ”粘贴到浏览器中时,我可以看到它是 javascript,并且我可能需要 (1) 将其保存为文件,或者 (2) 将其包含在 scrape_mustard.js 中。

但是,如果(1),我不知道如何引用该新文件,如果(2),我不知道如何正确定义所有 JavaScript 以便可以使用它。

我是 javascript 的新手,但也许这个问题并不太难?

感谢您的帮助!

最佳答案

我能够使用 js 模块 puppeteer.js 进行抓取。

下载node.js herenode.js 附带 npm,这让您在安装模块时变得更轻松。您需要使用 npm 安装 puppeteer。

在 RStudio 中,安装 puppeteer.js 时请确保您位于工作目录中。安装 node.js 后,执行 ( source ):

system("npm i puppeteer")

scrape_mustard.js:

// load modules
const fs = require("fs");
const puppeteer = require("puppeteer");

// page url
url = "https://www.mustardbet.com/sports/events/302698";

scrape = async() => {
    const browser = await puppeteer.launch({headless: false}); // open browser
    const page = await browser.newPage(); // open new page
    await page.goto(url, {waitUntil: "networkidle2", timeout: 0}); // go to page
    await page.waitFor(5000); // give it time to load all the javascript rendered content
    const html = await page.content(); // copy page contents
    browser.close(); // close chromium
    return html // return html object
};

scrape().then((value) => {
    fs.writeFileSync("./stackoverflow/page.html", value) // write the object being returned by scrape()
});

要在R中运行scrape_mustard.js:

library(magrittr)

system("node ./stackoverflow/scrape_mustard.js")

html <- xml2::read_html("./stackoverflow/page.html")

oddsMajor <- html %>% 
  rvest::html_nodes(".odds-major")

betNames <- html %>% 
  rvest::html_nodes("h3")

控制台输出:

{xml_nodeset (60)}
 [1] <span class="odds-major">2</span>
 [2] <span class="odds-major">14</span>
 [3] <span class="odds-major">15</span>
 [4] <span class="odds-major">16</span>
 [5] <span class="odds-major">17</span>
 [6] <span class="odds-major">23</span>
 [7] <span class="odds-major">25</span>
 [8] <span class="odds-major">32</span>
 [9] <span class="odds-major">33</span>
[10] <span class="odds-major">39</span>
[11] <span class="odds-major">47</span>
[12] <span class="odds-major">54</span>
[13] <span class="odds-major">55</span>
[14] <span class="odds-major">58</span>
[15] <span class="odds-major">58</span>
[16] <span class="odds-major">64</span>
[17] <span class="odds-major">73</span>
[18] <span class="odds-major">73</span>
[19] <span class="odds-major">92</span>
[20] <span class="odds-major">98</span>
...
> betNames
{xml_nodeset (60)}
 [1] <h3>Charles Howell III</h3>\n
 [2] <h3>Brian Harman</h3>\n
 [3] <h3>Austin Cook</h3>\n
 [4] <h3>J.J. Spaun</h3>\n
 [5] <h3>Webb Simpson</h3>\n
 [6] <h3>Cameron Champ</h3>\n
 [7] <h3>Peter Uihlein</h3>\n
 [8] <h3>Seung-Jae Im</h3>\n
 [9] <h3>Nick Watney</h3>\n
[10] <h3>Graeme McDowell</h3>\n
[11] <h3>Zach Johnson</h3>\n
[12] <h3>Lucas Glover</h3>\n
[13] <h3>Corey Conners</h3>\n
[14] <h3>Luke List</h3>\n
[15] <h3>David Hearn</h3>\n
[16] <h3>Adam Schenk</h3>\n
[17] <h3>Kevin Kisner</h3>\n
[18] <h3>Brian Gay</h3>\n
[19] <h3>Patton Kizzire</h3>\n
[20] <h3>Brice Garnett</h3>\n
...

我确信可以使用 phantomjs 来完成,但我发现 puppeteer 更容易抓取 javascript 渲染的网页。另请记住 phantomjs is no longer being developed .

关于javascript - 抓取引用 R 中外部 javascript 脚本的 Javascript 渲染网页,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53339598/

相关文章:

javascript - 网站已创建但显示不安全消息

javascript - jQuery mobile - 在不知道页面名称的情况下强制 pagebeforeshow

r - 使用 data.table 计算 adstock

regex - 如何用正则表达式填充两个字符之间的间隙

javascript - 如何从javascript链接获取下载位置?

javascript - KendoGrid - 隐藏有条件的详细信息列

Javascript 函数在通过 PHP require 获取的网站部分中不起作用

r - 在 coord_flip 之后翻转分组条形图中数据的顺序

excel - 登录后如何网页抓取 Steam?

java - 使用 jsoup 和 selenium 进行网页抓取