python - 将表情符号视为正则表达式中的一个字符

标签 python regex python-2.7 python-unicode unicode-literals

<分区>

这是一个小例子:

reg = ur"((?P<initial>[+\-👍])(?P<rest>.+?))$"

(在这两种情况下，文件都有 -*- 编码:utf-8 -*-)

在 Python 2 中:

re.match(reg, u"👍hello").groupdict()
# => {u'initial': u'\ud83d', u'rest': u'\udc4dhello'}
# unicode why must you do this

而在 Python 3 中:

re.match(reg, "👍hello").groupdict()
# => {'initial': '👍', 'rest': 'hello'}

上述行为是 100% 完美的，但目前无法切换到 Python 3。在 2 中复制 3 的结果的最佳方法是什么，它适用于窄和宽 Python 构建？ 👍 似乎以“\ud83d\udc4d”的格式出现，这就是让这件事变得棘手的原因。

最佳答案

在 Python 2 窄版中，非 BMP 字符是两个代理代码点，因此您不能在 [] 语法中正确使用它们。 u'[👍]等同于u'[\ud83d\udc4d]'，表示“匹配其中一个\ud83d 或 \udc4d。Python 2.7 示例:

>>> u'\U0001f44d' == u'\ud83d\udc4d' == u'👍'
True
>>> re.findall(u'[👍]',u'👍')
[u'\ud83d', u'\udc4d']

要同时修复 Python 2 和 3，请匹配 u'👍 或 [+-]。这将在 Python 2 和 3 中返回正确的结果:

#coding:utf8
from __future__ import print_function
import re

# Note the 'ur' syntax is an error in Python 3, so properly
# escape backslashes in the regex if needed.  In this case,
# the backslash was unnecessary.
reg = u"((?P<initial>👍|[+-])(?P<rest>.+?))$"

tests = u'👍hello',u'-hello',u'+hello',u'\\hello'
for test in tests:
    m = re.match(reg,test)
    if m:
        print(test,m.groups())
    else:
        print(test,m)

输出(Python 2.7):

👍hello (u'\U0001f44dhello', u'\U0001f44d', u'hello')
-hello (u'-hello', u'-', u'hello')
+hello (u'+hello', u'+', u'hello')
\hello None

输出(Python 3.6):

👍hello ('👍hello', '👍', 'hello')
-hello ('-hello', '-', 'hello')
+hello ('+hello', '+', 'hello')
\hello None

关于python - 将表情符号视为正则表达式中的一个字符，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/48274890/

上一篇：python - 如何在不弹出的情况下查看双端队列的前面？

下一篇：python - 在 flask-restful add_resource() 中接受多个参数

相关文章：

python - 按列对 NumPy float 组进行排序

javascript - 通过 Javascript 在一次匹配中获取多个正则表达式匹配结果

java - 如何使用 String 的 replaceAll 在某些字符前面不替换

javascript - 正则表达式从函数组中获取单个函数体

python-2.7 - 使用 Python Tweepy 在 Twitter API 上发出非常具体的时间请求(到第二个)？

python - 在框架中使用 Button(self) 创建按钮与使用 Button(parent) 创建按钮有什么区别？

python - 在 NumPy 数组中沿轴获取 N 个最大值和索引

python - “DataFrame”对象不可调用 : datetime

python - 根据其他两个数组中公共(public)值的索引从数组返回值

python - 为什么 PySpark 找不到 py4j.java_gateway？