python - Word 文档中的正则表达式

我有一个大文本文档，引号不一致，即

...Dolore magna aliquam “lorem ipsum” dolor sit amet, 'consectetuer adipiscing" elit, volutpat. Ut "wisi" enim...

我想将每种样式的引号转换为 Guillemet 样式(» 和 «)，因此示例句子应该类似于

...Dolore magna aliquam »lorem ipsum« dolor sit amet, »consectetuer adipiscing« elit, volutpat. Ut »wisi« enim...

我正在考虑正则表达式

`` ["'”“](.*?)["'”“]``

但我只知道如何用Python编写代码。有没有办法在 Python 中实现这一点？如果没有，有人可以提供有关如何直接在 MS Word 中完成此操作的提示。我尝试使用查找/替换和通配符，但使用引号的不一致让我感到困扰。

最佳答案

尝试这个模式:

([“'"](?=[a-zA-Z\,\.\s])([a-zA-Z\,\.\s]*)[”'"])

替换:

»$2«

编辑:既然你提到了Python，我就想出了一些绝对可行的东西:

#!/usr/bin/python
# coding: utf-8
import os, sys
import re
import codecs

with codecs.open('/path/to/file.txt', 'r', 'utf-8') as f:
    encoded = f.read()
    encoded = encoded.replace( u'\u201c', u'\"')
    encoded = encoded.replace( u'\u201d', u'\"')
    encoded = encoded.encode('utf-8')
    result = re.sub('(\s[\“\'\"](?=[a-zA-Z\,\.\s]*)([a-zA-Z\,\.\s]*)[\”\'\"]\s)', ' »\\2« ', encoded)
    decoded_result = result.decode('utf-8')
    print format(decoded_result)

将 /path/to/file.txt 替换为文件的位置(使用 utf-8 编码保存)。

由于标点符号中使用的字符编码，上面的代码执行了一些与标准搜索和替换不同的操作。可能有一种更简洁的方法来获得相同的最终结果，尽管整个编码过程对于 Python 来说很棘手，所以这是任何人的猜测。

关于python - Word 文档中的正则表达式，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/22751134/

python - Word 文档中的正则表达式

上一篇：python - 将嵌套字典转换为数据框

下一篇：python - 为什么Python的pprint模块会检查sys.modules中的 `"区域设置？