python - 如何使用正则表达式或工具包将句子解析为标记

标签 python regex xml-parsing beautifulsoup lxml

如何使用正则表达式或 beautifulsoup、lxml 等工具包解析这样的句子:

input = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

进入此:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

我无法使用re.findall("<person>(.*?)</person>", input)因为标签不同。

最佳答案

看看使用 BeautifulSoup 是多么容易:

from bs4 import BeautifulSoup

data = """Yesterday<person>Peter Smith</person>drove to<location>New York</location>"""

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    print item

打印:

Yesterday
<person>Peter Smith</person>
drove to
<location>New York</location>

UPD(将非标记项拆分为空格并在新行上打印每个部分):

soup = BeautifulSoup(data, 'html.parser')
for item in soup:
    if not isinstance(item, Tag):
        for part in item.split():
            print part
    else:
        print item

打印:

Yesterday
<person>Peter Smith</person>
drove
to
<location>New York</location>

希望有帮助。

关于python - 如何使用正则表达式或工具包将句子解析为标记，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22665065/

上一篇：python - Django - 在管理中保存时更新另一个模型

下一篇：python - pandas 执行 "true"concat

相关文章：

python - 在dask数据框中加载oracle数据框

python - 重新排序 Python 列表

regex - 如何使用正则表达式匹配字符串中的两个或多个点

java - 高效解析庞大的字符串响应

java - 在java中的模式匹配中重用消耗的字符？

python - 使用 python 元素树将节点插入到 XML 中

python - pandas.value_counts 不适用

javascript - 循环遍历 XML 解析器？

java - 在 Java 中使用 XPath 解析 XML 时出现 fatal error

python - 类型错误 : expected string or bytes-like object pandas variable