python - 如何抓取紧跟在某个元素之后的元素？

我有一个如下所示的 Html 文档:

<div id="whatever">
  <a href="unwanted link"></a>
  <a href="unwanted link"></a>
  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
  <a href="interesting link"></a>
  ...
</div>

我只想抓取紧跟在 code 标签之后的链接。如果我执行 soup.findAll('a')，它会返回所有超链接。

如何让 BS4 在特定的 code 元素之后开始抓取？

最佳答案

尝试 soup.find_all_next() :

>>> tag = soup.find('div', {'id': "whatever"})
>>> tag.find('code').find_all_next('a')
[<a href="interesting link"></a>, <a href="interesting link"></a>]
>>>

喜欢soup.find_all() ，但它会找到所有标签在标签之后。

如果您想删除 <a> <code> 之前的标签, 我们有一个叫做 soup.find_all_previous() 的函数:

>>> tag.find('code').find_all_previous('a')
[<a href="unwanted link"></a>, <a href="unwanted link"></a>]

>>> for i in tag.find('code').find_all_previous('a'):
...     i.extract()
...     
... 
<a href="unwanted link"></a>
<a href="unwanted link"></a>

>>> tag
<div id="whatever">


  ...
  <code>blah blah</code>
  ...
  <a href="interesting link"></a>
<a href="interesting link"></a>
  ...
</div>
>>>

那就是:

查找所有 <a> <code> 之前的标签标签。
使用soup.extract()用for循环删除它们。

关于python - 如何抓取紧跟在某个元素之后的元素？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34478734/

上一篇：python - 使用 Python 注册一个 "Hello World"DBus 服务、对象和方法

下一篇：python - Python 中的高阶函数

python - 如何在Python 3中避免 "no module installed"

具有抽象形状马赛克的 Python 照片马赛克

python - GeoDjango 图层映射和外键

python - 使用 cx_Freeze 导入 matplotlib.pyplot 和 BeautifulSoup

python - 如何遍历 Beautiful Soup 元素的 HTML 属性？

python - 找到标签内容后出现问题，无法求和

python - 在 Python 中计算时间跨度

python - 当使用 __all__ 在父模块中公开时，如何防止用户从子模块导入 x

python - 递归地在每个深度绘制带有颜色的谢尔宾斯基三角形？