如何使用正则表达式或 beautifulsoup、lxml 等工具包解析这样的句子:
input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
进入此:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
我无法使用re.findall("<person>(.*?)</person>", input)
因为标签不同。
最佳答案
看看使用 BeautifulSoup
是多么容易:
from bs4 import BeautifulSoup
data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>
UPD(将非标记项拆分为空格并在新行上打印每个部分):
soup = BeautifulSoup(data, 'html.parser')
for item in soup:
if not isinstance(item, Tag):
for part in item.split():
print part
else:
print item
打印:
Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>
希望有帮助。
关于python - 如何使用正则表达式或工具包将句子解析为标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/22665065/