Python、BeautifulSoup - <div> 文本和 <img> 属性顺序正确

我有一小段 HTML，我想使用 BeautifulSoup 运行它。我已经掌握了基本导航，但这个让我难住了。

这是一段 HTML 示例(完全是虚构的):

<div class="textbox">
    Buying this item will cost you 
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    silver credits and
    <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
    golden credits
</div>

使用 img 标签的“alt”属性，我希望看到以下结果: 购买此元素将花费您 1 个银币和 1 个金币

我不知道如何按顺序循环 div 标签。我可以执行以下操作来提取 div 标签中包含的所有文本

html = BeautifulSoup(string)
print html.get_text()

获取 div 标签中包含的所有文本，但这会给我这样的结果: 购买此元素将花费您银币和金币

同样，我可以通过这样做从 img-tags 中获取 alt-attributes 的值:

html = BeautifulSoup(string).img
print html['alt']

当然这只会给我属性值。

如何以正确的顺序遍历所有这些元素？是否可以连续读取div元素中的文本和img元素的属性？

最佳答案

你可以遍历一个标签的所有子标签，包括文本；测试它们的类型以查看它们是 Tag 还是 NavigableString 对象:

from bs4 import Tag

result = []
for child in html.find('div', class_='textbox').children:
    if isinstance(child, Tag):
        result.append(child.get('alt', ''))
    else:
        result.append(child.strip())

print ' '.join(result)

演示:

>>> from bs4 import BeautifulSoup, Tag
>>> sample = '''\
... <div class="textbox">
...     Buying this item will cost you 
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     silver credits and
...     <img align="adsbottom" alt="1" src="/1.jpg;type=symbol"/>
...     golden credits
... </div>
... '''
>>> html = BeautifulSoup(sample)
>>> result = []
>>> for child in html.find('div', class_='textbox').children:
...     if isinstance(child, Tag):
...         result.append(child.get('alt', ''))
...     else:
...         result.append(child.strip())
... 
>>> print ' '.join(result)
Buying this item will cost you 1 silver credits and 1 golden credits

关于Python、BeautifulSoup - <div> 文本和 <img> 属性顺序正确，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20590624/

Python、BeautifulSoup - <div> 文本和 <img> 属性顺序正确

上一篇：javascript - 使用 jQuery 动态添加 'divs'

下一篇：javascript - css:拖动图像在 Firefox 中不起作用