javascript - 在 headless 服务器云上运行的特定网站上的 Puppeteer 超时

标签 javascript node.js web-scraping google-cloud-platform puppeteer

我制作了一个在我的计算机上运行良好的 node.js 网络抓取程序代码,但是,当我部署到运行 Debian 的 Google Cloud VM 实例时,它返回特定网站的超时错误。我为 puppeteer 操作者尝试过许多不同的设置,但似乎都不起作用。我相信当我从谷歌云服务器运行时,我试图抓取的网站会阻止我的代码,但当我从我的计算机运行时不会。抓取部分在我的电脑上运行良好。 Puppeteer 找到 HTML 标签并检索信息。

const puppeteer = require('puppeteer');
const GoogleSpreadsheet = require('google-spreadsheet');
const { promisify } = require('util');
const credentials = require('./credentials.json');

async function main(){

    const scrapCopasa = await scrapCopasaFunction();

    console.log('Done!')

}



async function scrapCopasaFunction() {

    const browser = await puppeteer.launch({
        args: ['--no-sandbox'], 
    });
    const page = await browser.newPage();
    //await page.setDefaultNavigationTimeout(0);
    //await page.setViewport({width: 1366, height: 768});
    await page.setUserAgent('Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36');
    await page.goto('http://www.copasa.com.br/wps/portal/internet/abastecimento-de-agua/nivel-dos-reservatorios');
    //await new Promise(resolve => setTimeout(resolve, 5000));
    
    let isUsernameNotFound = await page.evaluate(() => {
        if(document.getElementsByClassName('h2')[0]) {
            if(document.getElementsByTagName('h2')[0].textContent == "Sorry, this page isn't available.") {
                return true;
            }
        }
    });

    if(isUsernameNotFound) {
        console.log('Account not exists!');        
        await browser.close();
        return;
    }


    let reservoirLevelsCopasa = await page.evaluate(() => {
        const tds = Array.from(document.querySelectorAll('table tr td'))
        return tds.map(td => td.innerText)        
    });


    const riomanso = reservoirLevelsCopasa[13].replace(",",".").substring(0,5);
    const serraazul = reservoirLevelsCopasa[17].replace(",",".").substring(0,5);
    const vargemdasflores = reservoirLevelsCopasa[21].replace(",",".").substring(0,5);

    await browser.close();

    return[riomanso, serraazul, vargemdasflores];

}



main();

我得到的错误如下:

(node:6425) UnhandledPromiseRejectionWarning: TimeoutError: Navigation Timeout Exceeded: 30000ms exceeded
    at /home/xxx/reservoirs/node_modules/puppeteer/lib/LifecycleWatcher.js:142:21
    at async FrameManager.navigateFrame (/home/xxx/reservoirs/node_modules/puppeteer/lib/FrameManager.js:94:17)
    at async Frame.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/FrameManager.js:406:12)
    at async Page.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/Page.js:674:12)
    at async scrapCopasaFunction (/home/xxx/reservoirs/reservatorios.js:129:5)
    at async main (/home/xxx/reservoirs/reservatorios.js:9:25)
  -- ASYNC --
    at Frame.<anonymous> (/home/xxx/reservoirs/node_modules/puppeteer/lib/helper.js:111:15)
    at Page.goto (/home/xxx/reservoirs/node_modules/puppeteer/lib/Page.js:674:49)
    at Page.<anonymous> (/home/xxx/reservoirs/node_modules/puppeteer/lib/helper.js:112:23)
    at scrapCopasaFunction (/home/xxx/reservoirs/reservatorios.js:129:16)
    at processTicksAndRejections (internal/process/task_queues.js:93:5)
    at async main (/home/xxx/reservoirs/reservatorios.js:9:25)
(Use `node --trace-warnings ...` to show where the warning was created)
(node:6425) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async f
unction without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled
 promise rejection, use the CLI flag `--unhandled-rejections=strict` (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode)
. (rejection id: 1)
(node:6425) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not ha
ndled will terminate the Node.js process with a non-zero exit code.

最佳答案

对于 puppeteer 来说,云函数有点慢。有一个 GitHub issue #3120 .对此。如果可能的话,您可以为该功能分配更多的 CPU/内存。您为 chrome 提供的 CPU 和 RAM 越多,它就会越快。

你可以给 goto 添加超时,这是以毫秒为单位的最大导航时间,默认为 30 秒,传递 0 以禁用超时。

await page.goto('http://www.copasa.com.br', { timeout: 60000 });

您还可以使用 setDefaultTimeout 设置导航超时和 setDefaultNavigationTimeout它优先于 setDefaultTimeout。

page.setDefaultNavigationTimeout(60000)

关于javascript - 在 headless 服务器云上运行的特定网站上的 Puppeteer 超时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65657965/

相关文章:

javascript - 如何覆盖 Javascript 中的方法并将其应用于原始对象,而不是子对象?

node.js - 如何在NodeMailer的附件邮件选项中使用PDFMake的pdf输出?

javascript - Puppeteer 始终未定义,但 devtools 适用于嵌套 Node 列表

linux - 共享无 gui 主机上的 Selenium

javascript - 如何从 Accuweather api 获取城市的唯一 ID?

javascript - 如何通过 jquery 选择器字符串查找 Div 元素和输入元素

javascript - 为什么 Node.js querystring.parse() 返回一个不是从 JavaScript `Object` 继承的对象?

python - 如何使用 scrapy 从具有数据库的网页中提取数据

javascript - 从维基百科解析请求中检索 pageid

node.js - 将 nguniversal/express-engine 添加到 Angular proj : "Cannot find BrowserModule import in/src/app/app.module.ts"