我正在使用 beautifulsoup4 和 Python 来从网络上抓取内容,我试图用它从特定的 html 标签中提取内容,同时忽略其他标签。
我有以下 html:
<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>
我的目标是了解如何指示 python 只获取 <p>
来自父级的元素 <div> class="the-one-i-want">
,否则忽略所有 <div>
就在里面。
目前,我通过以下方法定位父div的内容:
content = soup.find('div', class_='the-one-i-want')
但是,我似乎无法弄清楚如何进一步指定仅提取 <p>
标签没有错误。
最佳答案
h = """<div class="the-one-i-want">
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
<div class="random-inserted-element-i-dont-want">
<content>
</div>
<p>
"random text content here and about"
</p>
<p>
"random text content here and about"
</p>
</div>"""
您可以在找到后使用find_all("p")
:
from bs4 import BeautifulSoup
soup = BeautifulSoup(h)
print(soup.find("div","the-one-i-want").find_all("p"))
或者使用CSS选择:
print(soup.select("div.the-one-i-want p"))
两者都会给你:
[<p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>, <p>\n "random text content here and about"\n </p>]
find_all
只会查找具有类 the-one-i-want
的 div 后代,这同样适用于我们的 select
关于python - 使用 Beautifulsoup4 获取父标签中的某些标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/38021706/