node.js - 抓取重定向的页面

标签 node.js cheerio

我尝试抓取一个简单的页面(需要欢呼和请求): https://www.ishares.com/uk/individual/en/products/251824/

代码失败。我相信这是因为,为了到达上述内容,用户会在上一页上提示“个人”或“机构”,因此被重定向。

我尝试了该网址的不同变体,但都失败了。

如何使用 node.js 获取原始 HTML?

这是代码:

var express = require('express');
var path = require('path');
var request = require('request');
var cheerio = require('cheerio');   // fast flexible implement of jQuery for server.
var fs = require('fs');

var app = express();
var port = 8000;
var timeLog = [];  // for dl to measure the time of events.

// var startTime = Date.now();


timeLog[0] = Date.now();
console.log('program initiated at time: '+new Date());


// example 1:  pull the webpage and print to console
var url ="https://www.ishares.com/uk/individual/en/products/251824/ishares-jp-morgan-emerging-markets-bond-ucits-etf";
url = "https://www.ishares.com/uk/individual/en/products/251824/";
url="https://www.ishares.com/uk/individual/en/products/251824/ishares-jp-morgan-emerging-markets-bond-ucits-etf?siteEntryPassthrough=true&locale=en_GB&userType=individual";


request(url,function functionName(err,resp,body) {
 var $ = cheerio.load(body);

 var distYield = $('.col-distYield');
 var distYieldText = distYield.text();
 console.log('we got to line 24');
 console.log(distYieldText);


 timeLog[2] = Date.now();
 console.log('data capture time: '+(timeLog[2] - timeLog[0])/1000+' seconds');

  if (err) {
    console.log(err);
  }else {
    //console.log(body);
    console.log('the body was written: success');
  }
});

// example 2:  download webpage and save file
var destination = fs.createWriteStream('./downloads/iSharesSEMB.html');
request(url)
  .pipe(destination);


// example 3:
var destination = fs.createWriteStream('./downloads/iSharesSEMB2.html');
request(url)
  .pipe(destination)
  .on("finish",function () {
    console.log('done');
  })
  .on('error',function (err) {
    console.log(err);
  });



timeLog[1] = Date.now();
console.log('program completed at time: '+new Date());
console.log('Asynchronous program run time: '+(timeLog[1] - timeLog[0])/1000+' seconds');

最佳答案

好的,我已经开始工作了。我启用了对 request 的 cookie 支持,但随后进入了重定向循环。添加一个 promise 就解决了。这里只是相关的HTML请求部分:

const request = require('request'),
    cheerio = require('cheerio');


const url = "https://www.ishares.com/uk/individual/en/products/251824/ishares-jp-morgan-emerging-markets-bond-ucits-etf?siteEntryPassthrough=true&locale=en_GB&userType=individual";

options = {
    jar: true
}

const getDistYield = url => {
    return new Promise((resolve, reject) => {
        request(url, options, function(err,resp,body) {
            if (err) reject(err);
            let $ = cheerio.load(body);
            resolve($('.col-distYield'));
        })
    })
}

getDistYield(url)
    .then((tag) => {
        console.log(tag.text())
    }).catch((e) => {
        console.error(e)
    })

输出:

Distribution Yield
The distribution yield represents the ratio of distributed income over the last 12 months to the fund’s current NAV.
as of 20-Feb-2018
4.82

另外,请注意我使用了您提供的最后一个网址。

我希望这对你有用:)

关于node.js - 抓取重定向的页面,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48883024/

相关文章:

javascript - catch 有反义词吗?

javascript - 错误 : Illegal arguments: undefined, 字符串

node.js - AWS Lambda 容器销毁事件

javascript - Node JS、Cheerio、获取 XML 版本

javascript - 如何构造一个 javascript 类以同时使用多个子方法?

javascript - 访问自己的 api 时无法读取未定义的属性

javascript - Cheerio:如何在标签 <td> 中获取文本数组

typescript - 将 cheerio 模块导入 TypeScript 应用程序

javascript - 无法使用 Cheerio 提取简单选择器的文本

javascript - Cheerio/NodeJS 获取div的背景图片