python - 使用 Python 和 Beautiful Soup 仅从页面上的 div 标签中提取文本

标签 python html css web-scraping beautifulsoup

我正在尝试将一个静态新闻网站作为一个元素进行抓取，我正在使用 Beautiful soup ，但我被困在一个包含 div 标签中的文本的页面上，这里的文本表示新闻文章

该网站的链接是 http://economictimes.indiatimes.com/magazines/panache/smoking-aces-chef-irshad-qureshis-interesting-stories-related-to-celebrities/articleshow/48712333.cms

新闻文本包含在以下格式中

<html>
<body>
<div class="normal" id="foo">
      " Many "
 <a href ='/some link' target = 'blank'>Bollywood</a>
 " stars today  are avowed foodies "
 <a href = 'link2'>Ranbir Kapoor</a>
 " Alia Bhat "
</div>
</body>
</html>

我想要的文字是“今天许多宝莱坞明星都是发誓的美食家。Alia Bhat”

也就是说，我想要所有文本，无论它们在哪里。

我能够使用 find_all('div','normal') 到达 div，但之后遇到了如何从页面检索所有文本元素的问题。

如果您需要更多信息，请告诉我。

最佳答案

要从 beautifulsoup 中的某个元素中提取 text，您可以使用 .text 属性:

>>> t  = """<div class="normal" id="foo">  Many  <a href ='/some link' target = 'blank'>Bollywood</a>  stars today  are avowed foodies  <a href = 'link2'>Ranbir Kapoor</a>  Alia Bhat  </div>"""
>>> bs = BeautifulSoup(t)
>>> print(bs.find('div').text)
  Many  Bollywood  stars today  are avowed foodies  Ranbir Kapoor  Alia Bhat

关于python - 使用 Python 和 Beautiful Soup 仅从页面上的 div 标签中提取文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40789117/

上一篇：html - 模态滚动/溢出问题

下一篇：javascript - Safari ios(iphone 和 ipad)仅在单击 devtools 时才能正常工作

html - 使用外部 CSS 样式表是否更高效？

html - 即使用户滚动，如何将侧边栏锁定到窗口的高度？

html - 如何在 Bootstrap 中进行这样的布局？

javascript - 如何在头部设置标题样式 (html)

javascript - Bootstrap css 即使链接到 css min 文件也不会加载

javascript - 页面内容中的 HTML Iframe

python - 如何使用 Tkinter 清除整个 Treeview

python - 如何在 Python 中下载谷歌图片搜索结果

python - 如何在python Tornado 上加载html图像文件