Python正则表达式: remove non-ASCII characters and words ending in number

我试图用字符串中的“”替换所有非 ASCII 字符(重音符号、符号)，然后替换所有以数字结尾的单词。

我认为 r'\W|\b[^a-z]*[^a-z]\b' 会这样做，因为我认为它说“删除非 ASCII 字符，或删除以 0 或多个非 ASCII 开头的整个单词”字母并以非字母结尾”。我所说的非字母是指与 [a-z] 不匹配的所有内容。然而，“hey2”，“a2”，“1a3”仍然存在

#remove all these words:
re.sub(r'\W|\b[^a-z]*[^a-z]\b',' ', "1 123 - hey2 a2 1a3 ".lower()) 
>>>' hey2 a2 1a3 '
#keep all these words:
re.sub(r'\W|\b[^a-z]*[^a-z]\b',' ', "1st first a2a 2bb esta' ".lower()) 
>>>'1st first a2a 2bb esta  '          #This works

我错过了什么？

最佳答案

Remove non-unicode characters and words ending in number

您似乎想删除任何非单词字符(与 \W 模式匹配)和任何“单词”(字母/数字/_ 序列， \w 模式)以数字结尾。

所以，你可以使用

re.sub(r'\W|\b\w*\d\b', ' ', s)

请注意，如果您在 Python 2.x 中处理 Unicode 字符串，则需要传递 re.UNICODE 标志来生成 \W 和 \w Unicode 识别。

图案详细信息

\W - 非单词字符(不是字母、数字或 _ 的任何字符)
| - 或
\b - 前导字边界
\w* - 零个或多个 (*) 单词字符
\d - 数字
\b - 尾随单词边界。

请注意，如果要将 _ 字符视为非单词字符，请将 \W 替换为 [\W_] 并\w 与 [^\W_]。

关于Python正则表达式: remove non-ASCII characters and words ending in number，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/45451446/

Python正则表达式: remove non-ASCII characters and words ending in number

上一篇：python - Scrapy:从网站上抓取所有文本，但不抓取超链接的文本

下一篇：Python:Selenium 和 PhantomJS