javascript - 如何拆分从网站上抓取的文本?

标签 javascript web-scraping

在我试图抓取的网站上,所有信息都在同一类 .panel-row-text 下。我不确定如何拆分此信息,以便它仅显示在相关标题下,因为现在每一行都会显示所有数据。

    const axios = require('axios');
    const cheerio = require('cheerio');

    const url = 'https://www.lseg.com/resources/1000-companies-inspire/2018-report-1000-companies-uk/search-1000-companies-uk-2018?results_per_page=100';

    axios(url)
      .then(response => {
        const html = response.data;
        const $ = cheerio.load(html);
        const dataTable = $('.tabular-data-panel > ul');
        const companyData= [];        
        //console.log(dataTable.length);

        dataTable.each(function(){
            const companyName = $(this).find('.panel-row-text').text();
            const website = $(this).find('.panel-row-text').text();
            const sector = $(this).find('.panel-row-text').text();
            const region = $(this).find('.panel-row-text').text();
            const revenueBand = $(this).find('.panel-row-text').text();

            companyData.push({
                companyName,
                website,
                sector,
                region,
                revenueBand,
            });

        });

        console.log(companyData);

      })
      .catch(console.error);

最佳答案

您可以很聪明地查询每个字段关联的标签。您可以先查询标签,然后使用 .next() 函数获取关联标签的值。

Note: I added an extra package named, camelcase, to make the queried labels/properties easier to read.

const axios = require('axios');
const cheerio = require('cheerio');
const camelCase = require('camelcase'); // added this to make properties readable

// use async / await feature
async function scrape(url) {

    // get html page
    const { data } = await axios.get(url);

    // convert html string to cheerio instance
    const $ = cheerio.load(data);

    // query all list items
    return $('.tabular-data-panel > ul')
        // convert cheerio collection to array for easier manipulation
        .toArray()
        // transform each item into proper key values
        .map(list => $(list)
            // query the label element
            .find('.panel-row-title')
            // convert to array for easier manipulation
            .toArray()
            // use reduce to create the object
            .reduce((fields, labelElement) => {
                // get the cheerio instance of the element
                const $labelElement = $(labelElement);
                // get the label of the field
                const key = $labelElement.text().trim();
                // get the value of the field
                const value = $labelElement.next().text().trim();
                // asign the key value into the reduced object
                // note that we used camelCase() to make the property easy to read
                fields[camelCase(key)] = value;
                // return the object
                return fields;
            }, {})
        );


}

async function main() {
    const url = 'https://www.lseg.com/resources/1000-companies-inspire/2018-report-1000-companies-uk/search-1000-companies-uk-2018?results_per_page=100';
    const companies = await scrape(url);
    console.log(companies);
}

main();

关于javascript - 如何拆分从网站上抓取的文本?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59403149/

相关文章:

javascript - 如何更改 Ext.Msg 中的按钮顺序?

javascript - 将小数值与空格字符串对齐?

python - Beautiful Soup (bs4) 如何只匹配一个,而且只有一个,css 类

python - 如何选择和提取两个元素之间的文本?

javascript - 使用 d3.js 的六 Angular 网格

javascript - 如何在将值插入javascript中的数组时匹配键

javascript - knockout : Make nested sortable automatically expand when adding a child

html - 使用 R 从包含超链接的网页中提取多个表

java - 如何使用 HtmlUnit 抓取源代码

spring - Spring 有没有针对 Web Scraping 的解决方案?