javascript - 如何使用 Nodejs 抓取 QlikView 表?

标签 javascript node.js web-scraping puppeteer

巴西政府website以来自不同案件和法院的法官的工资数据为特色。我想下载所有表格,但引用这些表格的数据不在我使用 request 时作为响应得到的 html 中。 .
为了解决这个问题,我使用了 puppeteercheerio打开浏览器,等待表加载,然后使用 JQuery 选择器并拉取数据。这是我的代码:

const puppeteer = require("puppeteer");
const cheerio = require("cheerio");


const main = async () => {
    const browser = await puppeteer.launch({ headless: false});
    const page = await browser.newPage();
    await page.goto("https://paineis.cnj.jus.br/QvAJAXZfc/opendoc.htm?document=qvw_l%2FPainelCNJ.qvw&host=QVS%40neodimio03&anonymous=true&sheet=shPORT63Relatorios");
    await sleep(10*1000);
    const html = await page.content();
    const $ = cheerio.load(html);
    console.log($(".injected").text())

}

async function sleep(miliseconds) {
    return new Promise(resolve => setTimeout(resolve, miliseconds));
}

main();
问题是我作为答案得到的表格不完整,只有几行和不完整的单元格:
P63_CE_TRIBUNALCNJTribunalMagistradoMês/Ano Ref.CNJADHAILTON LACET CORREIA PORTO12/2018ADRIANA FRANCO MELO MACHADO02/202103/202104/2021ADRIANA LINS DE OLIVEIRA BEZERRA12/2018ADRIANO DA SILVA ARAUJO08/201909/201910/201911/201912/201901/202002/202003/202004/202005/202006/202007/202008/202009/202010/202011/202012/202001/202102/202103/202104/2021ALESSANDRA VARANDAS PAIVA MA...12/2018ALEXANDRE CHINI NETO09/201810/2018Subsídio (R$)Direitos Pessoais (1)Indenizações (2)Direitos Eventuais (3)Total de Rendimentos (4)Previdência Pública (5) (R$)Imposto de Renda (6) (R$)Descontos Diversos (7) (R$)Retenção por Teto Constitucional (8) (R$)Total de Descontos (9)Rendimento Líquido (10)Remuneração do órgão de origem (11) (R$)Diárias (12) (R$)0,000,000,00463,16463,160,000,000,000,000,00463,160,000,001.698,450,000,000,001.698,450,000,000,000,000,001.698,4533.689,110,003.639,540,0067.378,220,0071.017,760,00191,130,000,00191,1370.826,6333.689,110,003.639,540,000,000,003.639,540,00191,130,000,00191,133.448,4133.689,110,000,000,000,004.631,614.631,610,001.272,050,000,001.272,053.359,560,000,003.371,830,000,000,003.371,830,00150,970,000,00150,973.220,8632.004,710,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,719.100,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,717.700,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,711.750,005.323,940,000,002.218,317.542,250,00618,290,000,00618,296.923,9632.004,719.100,005.323,940,000,002.661,977.985,910,00594,720,000,00594,727.391,1932.004,715.600,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,0032.004,710,0037.328,650,00594,720,000,00594,7236.733,9332.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,710,005.323,940,004.158,850,009.482,790,00594,720,000,00594,728.888,0732.004,710,005.323,940,004.158,850,009.482,790,00673,560,000,00673,568.809,2332.004,710,005.323,940,004.158,85286,699.769,480,00673,560,000,00673,569.095,9232.004,710,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,719.100,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,000,004.436,629.760,560,001.189,440,000,001.189,448.571,1232.004,714.550,005.323,940,000,002.661,977.985,910,00594,720,000,00594,727.391,1932.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,005.323,940,000,000,005.323,940,00594,720,000,00594,724.729,2232.004,714.550,000,000,000,004.631,614.631,610,001.272,050,000,001.272,053.359,560,000,003.127,300,000,000,003.127,300,00161,200,000,00161,202.966,1028.947,550,003.127,300,000,000,003.127,300,00114,300,000,00114,303.013,0028.947,5511.900,00
我尝试了 JQuery 选择器的几种变体,但没有成功。
我读到我可以使用 enigmajs 与 QlikView 通信然后提出我的要求。但是,事实证明,即使是文档中最基本的示例也无法在我使用的站点上正常运行。现在我被困住了。
如何从 QlikView 的表格中检索数据?
编辑:不幸的是,此特定 URL 似乎不适用于巴西以外的某些国家/地区。但是,我认为任何带有 QlikView 表的站点都可以用作答案的示例。本文作者(python)question与其他网站运行相同的问题。也许他的网址没有相同的访问问题。

最佳答案

第一次使用 page.waitForSelector()当内容开始出现时,在您知道可见的任何元素上。这是因为 page.content() 将在加载程序仍在显示时触发,并且可能没有 - 有趣的 - 内容可供选择。
然后你会注意到这个网站是在你想要访问的表格中使用延迟加载来实现的,这意味着你需要多次向下滚动表格以显示所有行。
你可以这样做:

  async function autoScroll(page) {
    await page.evaluate(async () => {
      await new Promise((resolve, reject) => {
        var totalHeight = 0;
        var distance = 300;
        var timer = setInterval(() => {
          var scrollHeight = document.querySelector("yourScrollableElement").scrollHeight;
          document.querySelector("yourScrollableElement").scrollBy(0, distance);
          totalHeight += distance;

          if (totalHeight >= scrollHeight) {
            clearInterval(timer);
            resolve();
          }
        }, 250);
      });
    });
  }
*更改适合您的加载时间和所需滚动长度的任何数字的时间和/或距离(以像素为单位)。
然后你应该能够在调用后检索所有数据
await autoScroll(page);
这样的
await Promise.all([
    await page.goto(SITE_LINK),
    page.waitForSelector("MyTableIDorWhatever"),
    await autoScroll(page);
    // pick data
]);

关于javascript - 如何使用 Nodejs 抓取 QlikView 表?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/68448157/

相关文章:

android - 找不到与给定名称匹配的资源 : attr ‘android:keyboardNavigationCluster’

python - 从 sciencedirect 自动下载

javascript - 使用 puppeteer 永远抓取同一页面

javascript - 将上一个和下一个按钮放在缩略图旁边

javascript - 淡出、替换 HTML 和淡入

javascript - 如何在 "If condition"中传递 bool 结果

javascript - 序列化内连接

javascript - 为什么我的 mvc ajax 找不到操作

node.js - 如何在不使用 unwind 的情况下使用 mongodb 中的数组元素对数据进行排序

python - 尝试使用 Scrapy 抓取 LinkedIn 时出现 999 响应