python - 匹配python正则表达式中的unicode字符

标签 python regex unicode non-ascii-characters character-properties

我已经阅读了 Stackoverflow 上的其他问题，但还没有更进一步。抱歉，如果这已经得到解答，但我没有得到任何建议。

>>> import re
>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/xmas/xmas1.jpg')
>>> print m.groupdict()
{'tag': 'xmas', 'filename': 'xmas1.jpg'}

一切都很好，然后我尝试了一些带有挪威字符的东西(或者更类似于 unicode 的东西):

>>> m = re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', '/by_tag/påske/øyfjell.jpg')
>>> print m.groupdict()
Traceback (most recent call last):
File "<interactive input>", line 1, in <module>
AttributeError: 'NoneType' object has no attribute 'groupdict'

如何匹配典型的 unicode 字符，例如 øæå？我也希望能够在上面的标签组和文件名中匹配这些字符。

最佳答案

您需要指定 re.UNICODE 标志，和使用 u 前缀将您的字符串输入为 Unicode 字符串:

>>> re.match(r'^/by_tag/(?P<tag>\w+)/(?P<filename>(\w|[.,!#%{}()@])+)$', u'/by_tag/påske/øyfjell.jpg', re.UNICODE).groupdict()
{'tag': u'p\xe5ske', 'filename': u'\xf8yfjell.jpg'}

这是在 Python 2 中；在 Python 3 中，您必须省略 u，因为所有字符串都是 Unicode，并且您可以省略 re.UNICODE 标志。

关于python - 匹配python正则表达式中的unicode字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/5028717/

上一篇：python - Postgresql DROP TABLE 不起作用

下一篇：python - 获取装饰函数的名称？

相关文章：

python - 向一叠玻璃杯中加水

python - 无法使用 Python 和 re 从列表中提取特定模式

regex - 正则表达式以匹配CSV分隔符

php-excel-reader - UTF-8 问题

python - 使用python在sqlite3中存储numpy数组时遇到问题

python - 在 python 中反转 _ 的显式赋值？

Perl unicode 哈希键查找

c - 是否有适用于任何语言/国家/地区的 UTF-8 语言环境？

python - 无法从 Raspberry Pi 发射 DC Thunder 导弹发射器

java - 男性和女性字符串的正则表达式是什么