python - 为什么我在抓取网站时会得到一个空列表？

url = 'https://inshorts.com/en/read/technology'
news_data = []
news_category = url.split('/')[-1]

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML,     like Gecko) Chrome/91.0.4472.124 Safari/537.36'}
data = requests.get(url, headers=headers)

if data.status_code == 200:
    soup = BeautifulSoup(data.content, 'html.parser')

    headlines = soup.find('div', class_=['news-card-title', 'news-right-box'])
    articles = soup.find('div', class_=['news-card-content', 'news-right-box'])

    if headlines and articles and len(headlines) == len(articles):
        news_articles = [
            {
                'news_headline': headline.find_all('span', attrs={'itemprop': 'headline'}).string,
                'news_article': article.find_all('div', attrs={'itemprop': 'articleBody'}).string,
                'news_category': news_category
            }
            for headline, article in zip(headlines, articles)
        ]
        news_data.extend(news_articles)

print(news_data)

上面的代码尝试从 inshorts 网站抓取数据并将其分为 3 类，即 news_headline、news_article 和 news_category

最佳答案

由于您的情况，您得到了空列表，它失败了，因为 headlines/articles 是 None，这意味着您的选择器找不到。

尝试更具体地选择元素，避免压缩多个列表并一次性获取信息 - 我使用 css selectors在这里，但您也可以使用 find()/find_all()。

选择所有文章元素，迭代它们并为每个元素选择信息:

...
if data.status_code == 200:
    soup = BeautifulSoup(data.content)

    for article in soup.select('[itemtype="http://schema.org/NewsArticle"]'):
        news_data.append(
            {
                'news_headline': article.select_one('[itemprop="headline"]').get_text(),
                'news_article': article.select_one('[itemprop="articleBody"]').get_text(),
                'news_category': news_category
            }
        )

print(news_data)

关于python - 为什么我在抓取网站时会得到一个空列表？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/77651555/

python - 为什么我在抓取网站时会得到一个空列表？

上一篇：json - 将 json 对象合并为一个

下一篇：javascript - webdriver browser.acceptAlert() 未关闭浏览器警报