python - 我如何表示此正则表达式不会出现 "bad character range"错误？

有更好的方法吗？

$ python
Python 2.7.9 (default, Jul 16 2015, 14:54:10)
[GCC 4.1.2 20080704 (Red Hat 4.1.2-55)] on linux2

Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> re.sub(u'[\U0001d300-\U0001d356]', "", "")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/fast/services/lib/python2.7/re.py", line 155, in sub
    return _compile(pattern, flags).sub(repl, string, count)
  File "/home/fast/services/lib/python2.7/re.py", line 251, in _compile
    raise error, v # invalid expression
sre_constants.error: bad character range

最佳答案

Python narrow and wide build (Python versions below 3.3)

该错误表明您正在使用“窄”(UCS-2) 构建，它仅支持 Unicode 代码点最多为 65535 作为一个“Unicode 字符”¹。代码点在 65536 以上的字符表示为代理项对，这意味着 Unicode 字符串 u'\U0001d300' 由两个窄构建的“Unicode 字符”组成。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import sys; sys.maxunicode
65535
>>> len(u'\U0001d300')
2
>>> [hex(ord(i)) for i in u'\U0001d300']
['0xd834', '0xdf00']

在“宽”(UCS-4)构建中，所有 1114111 个代码点都被识别为 Unicode 字符，因此 Unicode 字符串 u'\U0001d300' 恰好由一个“Unicode 字符”/Unicode 组成代码点。

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import sys; sys.maxunicode
1114111
>>> len(u'\U0001d300')
1
>>> [hex(ord(i)) for i in u'\U0001d300']
['0x1d300']

^{¹ 我使用“Unicode 字符”(在引号中)指代 Python Unicode 字符串中的一个字符，而不是一个 Unicode 代码点。字符串中“Unicode 字符”的个数是字符串的len()。在“窄”构建中，一个“Unicode 字符”是UTF-16 的16 位编码单元，因此一个星体字符将显示为两个“Unicode 字符”。在“宽”构建中，一个“Unicode 字符”始终对应一个 Unicode 代码点。}

用正则表达式匹配星界字符

宽构建

问题中的正则表达式在“宽”构建中正确编译:

Python 2.6.6 (r266:84292, May  1 2012, 13:52:17)
[GCC 4.4.6 20110731 (Red Hat 4.4.6-3)] on linux2
>>> import re; re.compile(u'[\U0001d300-\U0001d356]', re.DEBUG)
in
  range (119552, 119638)
<_sre.SRE_Pattern object at 0x7f9f110386b8>

窄体

但是，相同的正则表达式在“窄”构建中不起作用，因为引擎无法识别代理项对。它只是将 \ud834 视为一个字符，然后尝试创建一个从 \udf00 到 \ud834 的字符范围，但失败了。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> [hex(ord(i)) for i in u'[\U0001d300-\U0001d356]']
['0x5b', '0xd834', '0xdf00', '0x2d', '0xd834', '0xdf56', '0x5d']

解决方法是使用 same method as done in ECMAScript ，我们将在其中构造正则表达式以匹配代表代码点的代理项。

Python 2.7.8 (default, Jul 25 2014, 14:04:36)
[GCC 4.8.3] on cygwin
>>> import re; re.compile(u'\ud834[\udf00-\udf56]', re.DEBUG)
literal 55348
in
  range (57088, 57174)
<_sre.SRE_Pattern object at 0x6ffffe52210>
>>> input =  u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> input
u'Sample \U0001d340. Another \U0001d305. Leave alone \U00011000'
>>> re.sub(u'\ud834[\udf00-\udf56]', '', input)
u'Sample . Another . Leave alone \U00011000'

使用regexpu为 Python 窄构建派生星体平面正则表达式

由于Python narrow build中匹配星界字符的构造与ES5相同，您可以使用regexpu，一个将ES6中的RegExp文字转换为ES5的工具，为您进行转换。

只需在 ES6 中粘贴等效正则表达式(注意 u 标志和 \u{hh...h} 语法):

/[\u{1d300}-\u{1d356}]/u

然后你得到了可以在 Python 窄构建和 ES5 中使用的正则表达式

/(?:\uD834[\uDF00-\uDF56])/

当您想在 Python 中使用正则表达式时，请注意删除 JavaScript RegExp 文字中的分隔符 /。

当范围分布在多个高代理项(U+D800 到 U+DBFF)时，该工具非常有用。比如我们要匹配字符范围

/[\u{105c0}-\u{1cb40}]/u

Python narrow build 和 ES5 中等价的正则表达式是

/(?:\uD801[\uDDC0-\uDFFF]|[\uD802-\uD831][\uDC00-\uDFFF]|\uD832[\uDC00-\uDF40])/

推导起来相当复杂且容易出错。

Python 3.3 及以上版本

Python 3.3工具 PEP 393 ，它消除了窄构建和宽构建之间的区别，Python 从现在开始表现得像宽构建。这完全消除了问题中的问题。

兼容性问题

虽然可以在 Python 窄构建中解决和匹配星界字符，但展望 future ，最好通过使用 Python 宽构建来更改执行环境，或者移植代码以与 Python 3.3 及更高版本一起使用。

窄构建的正则表达式代码对于普通程序员来说很难阅读和维护，并且在移植到 Python 3 时必须完全重写。

引用

How to find out if Python is compiled with UCS-2 or UCS-4?

关于python - 我如何表示此正则表达式不会出现 "bad character range"错误？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31603075/