python - 如何使用 BeautifulSoup 打开 parent 的包装4

标签 python html web-scraping beautifulsoup

给定一个 html 页面源,例如:

<html>
  <head></head>
  <body>
    <p><nobr><a href="...">Some link text</a></nobr><p>
  </body>
</html>

并且没有明确知道什么标签包装了 <a>元素(可以是任何东西,而不仅仅是 nobr)。如何创建一个循环来不断展开给定 <a> 的父级标签直到其父级是一个段落?

类似于:

import urllib3
from bs4 import BeautifulSoup as bs

http = urllib3.PoolManager()
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

page = "https://www.snopes.com/fact-check/rebuilding-iraq/"
link="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962"

r = http.request('get', page)
body = r.data
soup = bs(body, 'lxml')
a = soup.find('a', href=link)

while True:
    if a.parent.name == "p":
        break
    else:
        a.parent.name.unwrap() #doesnt work as name is string
print(soup)

最佳答案

使用find_parents作为给定的子标签。

import requests
from bs4 import BeautifulSoup
page = "https://www.snopes.com/fact-check/rebuilding-iraq/"
link="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962"

r = requests.get(page)
soup = bs(r.content, 'lxml')
a = soup.find('a', href=link)

for tag in a.find_parents('p'):
    print(tag)

输出:

<p><font class="copyright_text_color" color="" face=""><b>Origins:</b></font>   This item is “true” in the sense that Eric Rydbom is indeed an engineer stationed in Iraq with the Army’s <nobr>4th Infantry</nobr> Division, and he sends monthly <nobr>e-mail</nobr> dispatches such as the one quoted above to fellow members of his congregation at the <nobr><a href="http://www.elca.org/ScriptLib/OS/Congregations/cdsDetail.asp?congrno=12962" target="_blank">First Lutheran</a></nobr> Church of Richmond Beach in Shorline, Washington.  This piece was one of those messages, forwarded to the church’s prayer chain and thence to the larger world via the Internet.</p>
<小时/>

如果你想获取文本,只需使用。

for tag in a.find_parents('p'):
    print(tag.text)

关于python - 如何使用 BeautifulSoup 打开 parent 的包装4,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56665921/

相关文章:

python - django 1.7 迁移文件不能使用用户方法

python - OpenCV检测对象及其旋转

javascript - 在浏览器中播放原始 h264 直播流

python - 使用 Python 抓取 Twitter 页面

vba - 如何让网页元素显示在 excel Web 浏览器中

python - 如何在图像的人脸/ body 区域中检测太阳镜

javascript - 如何阻止鼠标离开动画

java - Selenium:动态查找 CSS 选择器

java - 如何使用 Java 从服务器端的特定 URL 获取 HTML 内容?

python - 通过 GridSearchCV 测试的仅一类折叠