python - 用 python 和 re 清理文本

我需要像下面的代码一样清理一些文本:

import re
def clean_text(text):
    text = text.lower()
    #foction de replacement
    text = re.sub(r"i'm","i am",text)
    text = re.sub(r"she's","she is",text)
    text = re.sub(r"can't","cannot",text)
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)
    return text

clean_questions= []
for question in questions: 
    clean_questions.append(clean_text(question))

并且这段代码必须给我干净的 questions 列表，但是我得到了干净的 questions 空的。我重新打开 spyder 并且列表已满但没有被清理然后重新打开它并且我得到它是空的.. 控制台错误说:

In [10] :clean_questions= [] 
   ...: for question in questions: 
   ...: clean_questions.append(clean_text(question))
Traceback (most recent call last):

  File "<ipython-input-6-d1c7ac95a43f>", line 3, in <module>
    clean_questions.append(clean_text(question))

  File "<ipython-input-5-8f5da8f003ac>", line 16, in clean_text
    text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 192, in sub
    return _compile(pattern, flags).sub(repl, string, count)

  File "C:\Users\hp\Anaconda3\lib\re.py", line 286, in _compile
   p = sre_compile.compile(pattern, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_compile.py", line 764, in compile
    p = sre_parse.parse(p, flags)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 930, in parse
    p = _parse_sub(source, pattern, flags & SRE_FLAG_VERBOSE, 0)

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 426, in _parse_sub
    not nested and not items))

  File "C:\Users\hp\Anaconda3\lib\sre_parse.py", line 580, in _parse
    raise source.error(msg, len(this) + 1 + len(that))

error: bad character range }-=

我正在使用 Python 3.6，特别是 Anaconda build Anaconda3-2018.12-Windows-x86_64。

最佳答案

您的字符类(如回溯中所示)无效； }在 = 之后在序数值中(} 是 125，= 是 61)，-在它们之间意味着它试图匹配 } 中的任何字符是 = 的序号的和之间。由于字符范围必须从低序数到高序数，125->61 是无意义的，因此是错误。

在某种程度上你很幸运；如果 - 周围的字符已被逆转，例如=-} , 你会默默地删除从序数 61 到 125 的所有字符，包括所有标准的 ASCII 字母，包括一堆标点符号，包括小写和大写。

您可以通过删除第二个 - 来解决这个问题在你的角色类中(你已经将它包含在不需要转义的类的开头)，从

text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]", "", text)

到

text = re.sub(r"[-()\"#/@;:<>{}=~|.?,]", "", text)

但我建议在这里删除正则表达式；大量文字标点符号出错的风险很高，还有其他完全不涉及正则表达式的方法应该可以正常工作，如果您转义了所有重要的东西，也不会让您担心(另一种方法是过度转义，这使得正则表达式不可读，并且仍然容易出错。

相反，将该行替换为 a simple str.translate call .首先，在函数之外，make a translation table of the things to remove :

# The redundant - is harmless here since the result is a dict which dedupes anyway
killpunctuation = str.maketrans('', '', r"-()\"#/@;:<>{}-=~|.?,")

然后替换行:

text = re.sub(r"[-()\"#/@;:<>{}-=~|.?,]","",text)

与:

text = text.translate(killpunctuation)

它的运行速度至少应该和正则表达式一样快(可能更快)，而且它更不容易出错，因为没有字符有特殊含义(翻译表只是从 Unicode 序号到 None 的映射，意思是删除，另一个序号，表示单个字符替换，或字符串，表示字符 -> 多字符替换；它们没有特殊转义的概念)。如果目标是消除所有 ASCII 标点符号，您最好使用 string用于定义转换表的模块常量(这也使代码更加自文档化，因此人们不会怀疑您是要删除所有标点符号还是只是删除一些标点符号，以及是否是故意的):

import string
killpunctuation = str.maketrans('', '', string.punctuation)

碰巧，您现有的字符串并没有删除所有标点符号(它遗漏了 ^、!、$ 等)，因此此更改可能不正确，但如果它是对的，一定能做到。如果它应该是标点符号的一个子集，您一定要添加关于该标点符号是如何选择的评论，这样维护者就不会怀疑您是否犯了错误。

关于python - 用 python 和 re 清理文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/55187374/

python - 用 python 和 re 清理文本

上一篇：python - 从一维 NumPy 数组创建二维掩码

下一篇：python - 使内容和选项卡在pyqt5中可扩展