javascript - 我无法使用 javascript 和 Puppeteer 获取 <span> 标记中包含的数字值

标签 javascript node.js web-scraping puppeteer

当我运行代码时,nameGen 页面评估返回一个类型错误,指出:“无法读取 null 的属性‘innerHTML’”。它所针对的跨度标签具有价格数值,这就是我想要达到的目标。如何访问我所定位的跨度标签中包含的数值?任何帮助或见解将不胜感激。我定位的元素如下所示:

<span id="priceblock_ourprice" class="a-size-medium a-color-price priceBlockBuyingPriceString">
    $44.99
</span>
const puppeteer = require('puppeteer');

let nameArr = [];
const rand1 = Math.random().toString(16).substr(2, 8);
nameArr.push({ id: 1, link: `<img src ="${rand1}">` });
//creates a random string to used as the image name and pushes it to an array

amazonScraper = (url) =>{
  (async () => {
    let imageUrl = url ;
    let path = `./scrapers/amazonScrapers/imageScraper/screenshots`;
    //assign a name to url and the path for saving images

    let browser =  await puppeteer.launch({headless: false});
    let page = await browser.newPage();
    //launch puppeteer

    await page.goto(imageUrl), { waitUntil: 'networkidle2' };
    //sends puppeteer to the url and waits until everything is rendered

    await page.waitForSelector('#landingImage');
    let element1 = await page.$('#landingImage');
    await element1.screenshot({ path: `${path}/${rand1}.png` });
    //screenshot the image

    let nameGen =await page.evaluate(() => {
      let name = document.getElementById('productTitle').innerHTML;
      return name;
    });
    // grabs name of the item

      let priceGen =await page.evaluate(() => {
      let price =  document.getElementById('priceblock_ourprice').innerHTML;
      return price;
    });
    //Broken: attempts to grab item price

    console.log(nameGen);
    console.log(priceGen);

    await browser.close();
    //closes puppeteer
})();
};

amazonScraper ("https://www.amazon.com/TOMLOV-Microscope-50X-1300X-Magnification-Ultra-Precise/dp/B08MVKKSLY/?_encoding=UTF8&pd_rd_w=yqTTn&pf_rd_p=2eed4166-2052-4602-96d1-514e72c433c6&pf_rd_r=8E0WGYYVYE5017ECAJPG&pd_rd_r=03b5a7f9-3f43-4f72-b9c8-d3ec581b450c&pd_rd_wg=jBNiN&ref_=pd_gw_crs_wish");
//calling scraper function

这是错误:

(node:11276) UnhandledPromiseRejectionWarning: Error: Evaluation failed: TypeError: Cannot read property 'innerHTML' of null
    at __puppeteer_evaluation_script__:2:66
    at ExecutionContext._evaluateInternal (c:\Users\grung\node_modules\puppeteer\lib\cjs\puppeteer\common\ExecutionContext.js:221:19)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async ExecutionContext.evaluate (c:\Users\grung\node_modules\puppeteer\lib\cjs\puppeteer\common\ExecutionContext.js:110:16)
    at async c:\Users\grung\javaScriptPractice\jsPractice\scrapers\amazonScrapers\imageScraper\scraper.js:32:21
(Use `node --trace-warnings ...` to show where the warning was created)
(node:11276) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:11276) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.```

最佳答案

您的代码中有几个问题:

  • 您需要等待该项目在页面上可用。看起来 priceblock_ourprice 是在页面发送到客户端后生成的。

    在 puppeteer 中,有一个内置函数来等待某个选择器:

    let priceGen =await page
      .waitForSelector('#priceblock_ourprice')
      .evaluate(() => {
        let price =  document.getElementById('priceblock_ourprice').innerHTML;
        return price;
      });
    
  • 亚马逊不使用单一 ID 进行定价。有几个正在使用中。一些例子:

    • priceblock_ourprice
    • priceblock_dealprice

    所以您可能也需要考虑这些。 您可以像这样等待多个项目:

    await page.waitForFunction((priceSelectors) =>
      document.querySelectorAll(priceSelectors).length, {}, priceSelectors
    )
    
const puppeteer = require('puppeteer');

(async () => {
  let browser = await puppeteer.launch({ headless: false, });
  let page = await browser.newPage();
  await page.goto('https://www.amazon.com/Insect-Lore-Butterfly-Growing-Kit/dp/B00000ISC5?ref_=Oct_DLandingS_D_a46a25b3_60&smid=ATVPDKIKX0DER');

  const priceSelectors = [
    '#priceblock_ourprice',
    '#priceblock_dealprice' /* more here if you find more selectors */
  ];

  await page.waitForFunction((priceSelectors) =>
    document.querySelectorAll(priceSelectors).length,
    {},
    priceSelectors // pass priceSelectors to wairForFunction
  )
  const pricer = await page.evaluate((priceSelectors) => {
    const priceRegex = /^\D\d+(\.\d+)?$/;
    const asSingleSelector = priceSelectors.join(',');
    const priceElements = document.querySelectorAll(asSingleSelector);
    let price;
    priceElements.forEach((item) => {
      if (item && // item is not null
        item.innerHTML && // innerHTML exists
        priceRegex.test(item.innerHTML)) { // make sure string is a price
        price = item.innerHTML;
      }
    });
    return price;
  }, priceSelectors); // pass priceSelectors to evaluate

  console.log(pricer);

  await browser.close();

})();

如果您在特定页面中找不到价格,则您可能错过了该特定场景的价格选择器

关于javascript - 我无法使用 javascript 和 Puppeteer 获取 <span> 标记中包含的数字值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67646044/

相关文章:

javascript - 无法从 VALUE_STRING 中反序列化 java.util.List 的实例

node.js - 如何解码收到的 URL

javascript - phantomjs 中的变量事务,抓取网页

python - 已编辑 : How do I create a "Nested Loop" that returns an item to the original list in Python and Scrapy

javascript - AJAX 调用从 ASP.NET Core Web api 动态构建 DataTable

javascript - 如何刷新主干应用程序中的页面

javascript - MooTools Fx.Slide 用于一页上的多个容器

javascript - Node.js 端口 3000 已经在使用,但实际上不是?

Node.js morgan 记录器输出未出现在控制台上

python - 抓取可以应用到这个正在主动重新计算的页面吗?