python - 无法抓取《华尔街日报》页面上 "div"类中的数据

标签 python web-scraping beautifulsoup python-requests

我正在尝试从《华尔街日报》网站上的文章中抓取文本内容。例如考虑以下 html 源:

<div class="article-content ">
       <p>BEIRUT—
      Carlos Ghosn, 
       who is seeking to clear his name in Lebanon, would face a very different path to vindication here, where endemic corruption and the former auto executive’s widespread popularity could influence the outcome of a potential trial. </p> <p>Mr. Ghosn, the former chief of auto makers

我正在使用以下代码:

res = requests.get(url)
html = BeautifulSoup(res.text, "lxml")
classid = "article-content "
item = html.find_all("div", {"class":classid})

这将返回一个空项目。我看到了其他一些人们建议的帖子 adding delays和 others但这些对我来说不起作用。计划将抓取的文本用于某些机器学习项目。

我订阅了《华尔街日报》，并在运行上述脚本时登录。

对此的任何帮助将不胜感激!谢谢

最佳答案

你的代码对我来说工作得很好。只要确保您正在搜索正确的“classid”即可。我认为这不会产生影响，但您可以尝试使用它作为替代方案:

item = html.find_all("div", class_ = classid)

关于python - 无法抓取《华尔街日报》页面上 "div"类中的数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59597753/

上一篇：python - numpy 数组连接 : "ValueError: all the input arrays must have same number of dimensions"

下一篇：python - 在 Python Gurobi 中删除变量

相关文章：

python - 从 bs4.element.Tag 获取项目

python - 在完全平坦的 HTML 层次结构上使用 BeautifulSoup

java - php、java、python、node.js 中 Cassandra 驱动程序的拓扑感知

python - 如何使用 BeautifulSoup 查找一张表中的所有行

python - 基于另一个 Python 创建一列

python - 抓取时无法获取头条内容

python - 在 python 中调用 Firefox webdriver

python - 即使使用真实浏览器的 header ，网站也会阻止curl

python - 由两个分类列定义的 pandas 数据框的加权时间聚合

python - 导入错误 : No module named 'selenium' in PyCharm