python - 使用 BeautifulSoup FindAll 进行网页抓取

我想在以下网站上下载上面需要知道的 4 篇文章的 href:

http://www.marketwatch.com/

但我无法使用 FindAll 唯一地识别它们。以下方法为我提供了文章，但也提供了许多其他也符合这些标准的文章。

trend_articles  = soup1.findAll("a", {"class": "link"})
href= article.a["href"]

trend_articles  = soup1.findAll("div", {"class": "content--secondary"})
href= article.a["href"]

有人有建议吗，我如何才能获得这 4 篇文章，而且只有这 4 篇文章？

最佳答案

这似乎对我有用:

from bs4 import BeautifulSoup
import requests

page = requests.get("http://www.marketwatch.com/").content
soup = BeautifulSoup(page, 'lxml')
header_secondare = soup.find('header', {'class': 'header--secondary'})
trend_articles = header_secondare.find_next_siblings('div', {'class': 'group group--list '})[0].findAll('a')

trend_articles = [article.contents[0] for article in trend_articles]
print(trend_articles)

关于python - 使用 BeautifulSoup FindAll 进行网页抓取，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43314445/

上一篇：python - 如何减去引用pandas中关联列的行值

下一篇：python - 在Python中使用 `for`循环语句解压序列结构

python - seaborn 热图 pandas 在 isnull 上的计算

python - 无法从奇怪的 json 内容中获取项目

javascript - 如何填充新的 Windows 表单元素

Python - re.findall 返回不需要的结果

python - BeautifulSoup，1 个元素有 2 个相同的链接，如何只打印 1 个？

python - 使用 Python 用全 0 值填充空白数据框列

python - Featuretools:即使没有日期时间相关列，它是否可以应用于单个表以生成特征？

python - 无法使用请求从网页中获取所有链接

python - 在Python中使用re.findall()进行网络爬虫