python - 在 Python 3 中遍历 TEI，某些实体的文本为空

我有一个 TEI 编码的 xml 文件，其实体如下:

<sp>
    <speaker rend="italic">Sampson.</speaker>
    <ab>
         <lb n="5"/>
         <hi rend="italic">Gregory:</hi>
         <seg type="homograph">A</seg> my word wee'l not carry coales.<lb n="6"/>
    </ab>
</sp>
<sp>
     <speaker rend="italic">Greg.</speaker>
     <ab>No, for then we should be Colliars.
         <lb n="7" rend="rj"/>
     </ab>
</sp>

完整文件非常大，但可以在此处访问:http://ota.ox.ac.uk/desc/5721 。我尝试使用 Python 3 遍历 xml 并获取与标签关联的所有文本，这是找到对话的地方。

import xml.etree.ElementTree as etree
tree = etree.parse('romeo_juliet_5721.xml')
doc = tree.getroot()
for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'):   
        print(i.tag, i.text)
>>> http://www.tei-c.org/ns/1.0}ab 
>>>                  
>>> {http://www.tei-c.org/ns/1.0}ab No, for then we should be Colliars.

输出很好地捕获了实体，但无法将“my word wee'l not Carry coales”识别为第一个 ab 的文本。如果它在不同的元素内，我就看不到它。我考虑过将整个元素转换为字符串并使用正则表达式(或通过剥离所有 xml 标签)获取元素文本，但我宁愿了解这里发生的情况。感谢您提供的任何帮助。

最佳答案

那是因为在 ElementTree模型中，文本“my word wee'l not Carry coales.” 被视为 tail <seg>元素而不是 text <ab> 。要获取元素的文本及其子元素的尾部，您可以尝试以下方式:

for i in doc.iter(tag='{http://www.tei-c.org/ns/1.0}ab'): 
    innerText = i.text+''.join((text.tail or '') for text in i.iter()).strip()  
    print(i.tag, innerText)

关于python - 在 Python 3 中遍历 TEI，某些实体的文本为空，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37062825/

上一篇：python - 在python中查找两个文件之间的差异

下一篇：python - 使用 Biopython 的搜索词返回登录号

相关文章：

java - 随机TransformerException，如何解决？

xml - 将工作项字段限制为 TFS 2010 中的特定用户

python - 使用 python 从 XML 中的子项中查找父项

python - 宽东亚字符与格式功能的对齐

python - 如何将日志记录模块中的单个记录器列入白名单

python - 实时向点云添加新点 - Open3D

xml - 对于从露天中的另一个模型继承的模型，它们是否需要具有不同的 namespace 或者它们是否可以共享相同的 namespace ？

python 2.7 xml - 从注释下方的特定注释中获取值

使用 XPath 的 Python XML 过滤

python - 如何阻止值返回 'None' ？