python-3.x - 如何在 python-docx 中提取使用跟踪更改插入的文本

我想从以“跟踪更改”模式编辑的 word 文档中提取文本。我想提取插入的文本并忽略删除的文本。

运行下面的代码我看到在“跟踪更改”模式下插入的段落返回一个空的 Paragraph.text

import docx

doc = docx.Document('C:\\test track changes.docx')

for para in doc.paragraphs:
    print(para)
    print(para.text)

有没有办法在修改后的插入(w:ins 元素)中检索文本？

我正在使用 python-docx 0.8.6、lxml 3.4.0、python 3.4、Win7

谢谢

最佳答案

我多年来一直遇到同样的问题(也许只要这个问题存在)。

通过查看@yiftah 发布的“etienned”代码和Paragraph 的属性，我找到了一个在接受更改后检索文本的解决方案。

诀窍是获得 p._p.xml获取该段落的 XML，然后对其使用“etienned”代码(即从 XML 代码中检索所有 <w:t> 元素，其中包含常规运行和 <w:ins> block )。

希望它能帮助像我一样迷失的灵魂:

from docx import Document

try:
    from xml.etree.cElementTree import XML
except ImportError:
    from xml.etree.ElementTree import XML


WORD_NAMESPACE = "{http://schemas.openxmlformats.org/wordprocessingml/2006/main}"
TEXT = WORD_NAMESPACE + "t"


def get_accepted_text(p):
    """Return text of a paragraph after accepting all changes"""
    xml = p._p.xml
    if "w:del" in xml or "w:ins" in xml:
        tree = XML(xml)
        runs = (node.text for node in tree.getiterator(TEXT) if node.text)
        return "".join(runs)
    else:
        return p.text


doc = Document("Hello.docx")

for p in doc.paragraphs:
    print(p.text)
    print("---")
    print(get_accepted_text(p))
    print("=========")

关于python-3.x - 如何在 python-docx 中提取使用跟踪更改插入的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38247251/

python-3.x - 如何在 python-docx 中提取使用跟踪更改插入的文本

上一篇：powerbi - PowerQuery:添加多列

下一篇：云优化与云原生