Python BeautifulSoup 获取文本第一个标签

标签 python beautifulsoup screen-scraping

我需要在 python 中使用 BeautifulSoup 将标签的文本获取到 li 标签的第一级。

问题是标签包含其他 li 标签,而这些标签又包含其他标签。

示例 html:

<li>
   <a href="http://lol.lol">Text1</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text1</a><-- DON'T GET THIS
   </li>
</li>
<li>
   <a href="http://lol.lol">Text2</a><-- GET THIS
   <li>
      <a href="http://lol.lol">Text2-2</a><-- DON'T GET THIS
   </li>
</li>

编辑:

我一直在测试,我不会只取出第一个 a 标签。

这是我尝试提取的原始部分:

<div id="categories_block_left" class="block block-highlighted">
<h4 class="title_block">
<span class="icon-box fa fa-bars"></span>
RELOJES
</h4>
<div class="block_content" style="">
<ul class="list-block list-group bullet tree dynamized" style="display: block;">
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/50-outlet" title="OUTLET">
OUTLET
<span id="leo-cat-50" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/47-adidas" title="Adidas">
Adidas
<span id="leo-cat-47" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/125-miss-sixty" title="Miss Sixty">
Miss Sixty
<span id="leo-cat-125" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/49-converse" title="Converse">
Converse
<span id="leo-cat-49" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/61-armand-basi" title="Armand Basi">
Armand Basi
<span id="leo-cat-61" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/79-marea" title="Marea">
Marea
<span id="leo-cat-79" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/86-marc-ecko" title="Marc Ecko">
Marc Ecko
<span id="leo-cat-86" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/107-festina" title="Festina">
Festina
<span id="leo-cat-107" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/135-seiko" title="Seiko">
Seiko
<span id="leo-cat-135" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/221-relojes-swatch-liquidar" title="Relojes Swatch liquidar">
Relojes Swatch liquidar
<span id="leo-cat-221" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/184-lotus" title="Lotus">
Lotus
<span id="leo-cat-184" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/195-lotus-hombre" title="Lotus Hombre">
Lotus Hombre
<span id="leo-cat-195" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/196-lotus-mujer" title="Lotus Mujer">
Lotus Mujer
<span id="leo-cat-196" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/236-lotus-infantil" title="Lotus Infantil">
Lotus Infantil
<span id="leo-cat-236" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/218-daniel-wellington" title="Daniel Wellington">
Daniel Wellington
<span id="leo-cat-218" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/197-viceroy" title="Viceroy">
Viceroy
<span id="leo-cat-197" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/198-viceroy-hombre" title="Viceroy Hombre">
Viceroy Hombre
<span id="leo-cat-198" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/199-viceroy-mujer" title="Viceroy Mujer">
Viceroy Mujer
<span id="leo-cat-199" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/235-viceroy-infantil" title="Viceroy Infantil">
Viceroy Infantil
<span id="leo-cat-235" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li>
<a href="http://www.joyeriasanchez.com/51-ice-watch" title="Ice watch">
Ice watch
<span id="leo-cat-51" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/64-relojes-swatch" title="Relojes Swatch">
Relojes Swatch
<span id="leo-cat-64" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/80-mark-maddox" title="Mark Maddox">
Mark Maddox
<span id="leo-cat-80" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/81-ferrari" title="Ferrari">
Ferrari
<span id="leo-cat-81" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/173-relojes-cadete" title="Relojes Cadete">
Relojes Cadete
<span id="leo-cat-173" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<span class="grower CLOSE"> </span><a href="http://www.joyeriasanchez.com/200-tous" title="Tous">
Tous
<span id="leo-cat-200" style="display:none" class="leo-qty badge pull-right"></span>
</a>
<ul style="display: none;">
<li>
<a href="http://www.joyeriasanchez.com/201-tous-kids" title="Tous Kids">
Tous Kids
<span id="leo-cat-201" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li>
<a href="http://www.joyeriasanchez.com/203-tous-mujer" title="Tous Mujer">
Tous Mujer
<span id="leo-cat-203" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/204-tous-hombre" title="Tous Hombre">
Tous Hombre
<span id="leo-cat-204" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</li>
<li class="last">
<a href="http://www.joyeriasanchez.com/220-certina" title="Certina">
Certina
<span id="leo-cat-220" style="display:none" class="leo-qty badge pull-right"></span>
</a>
</li>
</ul>
</div>
</div>

这是我尝试提取的代码:

req2 = requests.get(url2)
        html2 = BeautifulSoup(req2.text)
        catmenu = html2.find('div', {'id':'categories_block_left'})
        categorys = catmenu.find_all('li', recursive=False)
        for cat in categorys:
            categor = cat.find('a').getText()
            print ("   SubCategor:%s" % categor)

但是没有返回值,我只需要获取第一个a标签。
示例:

OUTLET,
Lotus,
Daniel Wellington,
Viceroy,
Ice watch,
Relojes Swatch,
Mark Maddox,
Ferrari,
Relojes Cadete,
Tous,
Certina

最佳答案

您可以在 find_all 中指定 recursive=False方法,这将只返回顶级 li 标签:

In [62]: soup.find_all('li', recursive=False)
Out[62]: 
[<li>
 <a href="http://lol.lol">Text1</a>
 <li>
 <a href="http://lol.lol">Text1</a>
 </li>
 </li>, <li>
 <a href="http://lol.lol">Text2</a>
 <li>
 <a href="http://lol.lol">Text2-2</a>
 </li></li>]

然后您可以从每个 li 的第一个 a 标记中检索文本:

In [63]: [li.find('a').text for li in soup.find_all('li', recursive=False)]
Out[63]: ['Text1', 'Text2']

关于Python BeautifulSoup 获取文本第一个标签,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34379889/

相关文章:

python - 屏幕抓取时处理 javascript 重页面的选项

javascript - 如何在 node.js 中将 HTML 页面转换为纯文本?

python - 在 Python 中实现请求重试

python - Pyinstaller 不能与线程一起正常工作

python - 使用列表理解将元组解包到函数中

python - 在两个 BeautifulSoup 元素之间拉出文本

python - 单击时更改按钮外观

python - 使用 BeautifulSoup 和 If 语句与 xml 文件交互

python - BeautifulSoup 没有提取所有 html

php - 如何用 PHP 抓取 AS400?