python - 如何从字符串中多次提取 HTML 标记模式?

标签 python html regex

我已经有了这个模式,我想根据它搜索字符串以查找所有匹配项。使用后findall() ,只打印最后一个匹配的。

我要处理的字符串如下:

'<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>'

我尝试使用以下代码从字符串中提取所有发明人。

INVENTORS_CONTENT_PATTERN = re.compile('<inventor sequence=".*" designation=".*">(.*?)</inventor>')

re.findall(INVENTORS_CONTENT_PATTERN, data)

我得到的结果是最后一个匹配的,而不是数据中的所有发明人:

['<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']

最佳答案

这个表达可能更接近您的想法:

<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>

测试

import re

regex = r'<inventor sequence="[^"]*" designation="[^"]*">(.*?)<\/inventor>'
test_str = """
<inventor sequence="001" designation="us-only"><addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="002" designation="us-only"><addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="003" designation="us-only"><addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook></inventor><inventor sequence="004" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor><inventor sequence="005" designation="us-only"><addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook></inventor>

"""
print(re.findall(regex, test_str))

输出

['<addressbook><last-name>Li</last-name><first-name>Shuo</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Liu</last-name><first-name>Xin Peng</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Sun</last-name><first-name>Sheng Yan</first-name><address><city>Beijing</city><country>CN</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Hua</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>', '<addressbook><last-name>Wang</last-name><first-name>Jun</first-name><address><city>Littleton</city><state>MA</state><country>US</country></address></addressbook>']
<小时/>

If you wish to explore/simplify/modify the expression, it's been explained on the top right panel of regex101.com. If you'd like, you can also watch in this link, how it would match against some sample inputs.

<小时/>

关于python - 如何从字符串中多次提取 HTML 标记模式?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/57634546/

相关文章:

r - 如何在R中匹配二项式表达式?

java - 将字符串内容键/值对映射到 HashMap

javascript - 如何将字典从 Jinja2(使用 Python)传递给 Javascript?

javascript - JavaScript 幻灯片上图像之间的缓慢转换? (特别是使用 Chrome )

html - 灰色透明不可点击后屏

javascript - 从 HTML DOM 中选择不同的标签集

r - R 中是否可以为正则表达式子字符串提供一组选项?

python - 使用 stride_tricks 创建重叠子数组

python - 获取错误 : module 'gym' has no attribute 'make'

python - 如何确定当前选定的行是否是 gtk.TreeView 上的最后一行?