python - 无法按位置或属性匹配时在 BeautifulSoup 中提取标签值

我正在使用 BS 抓取网页，但遇到了一个小问题。这是该页面的 HTML 片段。

<span style="font-family: arial;"><span style="font-weight: bold;">Artist:</span> M.I.A.<br>
</span>

一旦我得到汤，我如何才能找到这个标签并获得艺术家姓名，即 M.I.A. 我无法将标签与 style 属性相匹配，因为它在页面中的十几个地方使用。我什至不知道 span 标签的确切位置，因为它会在页面之间改变位置。因此，我无法按位置匹配。艺术家姓名发生变化，但标题跨度结构始终相同。

我只想提取艺术家姓名(M.I.A. 位)。

最佳答案

BeautifulSoup 有点死了，因为 SGMLParser 已被弃用。我建议您使用更好的 lxml 库——它甚至有 xpath支持!!

from lxml import html

text = '''
<span style="font-family: arial;">
    <span style="font-weight: bold;">Artist:</span>M.I.A.<br>
</span>
'''

doc = html.fromstring(text)
print ''.join(doc.xpath("//span/span[text()='Artist:']/../text()"))

这个 xpath 表达式意味着 “找到另一个 span 标签内的 span 标签并且包含文本 'Artist:' , 并获取包含标签的父级的所有文本"。它正确地打印了 M.I.A. 正如人们所期望的那样。

关于python - 无法按位置或属性匹配时在 BeautifulSoup 中提取标签值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/3422770/

上一篇：python - 为没有 GUI 的 python 程序制作键盘命令

下一篇：python - 如何在 Web2py 中对外键添加 NOT NULL 约束

相关文章：

python - 如何在python中调整base64编码图像的大小

Python 使用不带标签的 beautifulsoup 打印抓取的数据

python - 使用 BeautifulSoup 获取 html 中的链接

python - Beautiful Soup 返回包裹在 <div> 标签中的元素。我该如何剥离它们？

python - <U12 是什么类型？

python - 不修改 Django wsgi.conf 的 AWS ElasticBeanstalk 更新

python - 为什么更新此查询时页面响应没有改变？

python - Beautiful Soup 在现有元素上返回 None

python - 使其函数在按下 2 次或更多按键时执行

python - Pandas 系统地识别缺失的多索引分类值