python - 如何用正则表达式可移植地解析(Unicode)度数符号？

我正在为 Ubuntu 上的 sensors 实用程序的输出编写一个简单的正则表达式解析器。这是我正在解析的一行文本的示例:

temp1:        +31.0°C  (crit = +107.0°C)

这是我用来匹配它的正则表达式(在 Python 中):

temp_re = re.compile(r'(temp1:)\s+(\+|-)(\d+\.\d+)\W\WC\s+' 
                     r'\(crit\s+=\s+(\+|-)(\d+\.\d+)\W\WC\).*')

此代码按预期工作并且与我在上面给出的示例文本相匹配。我真正感兴趣的唯一位是数字，所以这个位:

(\+|-)(\d+\.\d+)\W\WC

以匹配 + 或 - 符号开始，以匹配 °C 结束。

我的问题是，为什么需要两个 \W(非字母数字)字符来匹配 ° 而不是一个？代码会在 Unicode 与我的表示不同的系统上中断吗？如果是这样，我怎样才能让它变得便携？

最佳答案

可能的可移植解决方案:

将输入数据转换为unicode，并在正则表达式中使用re.UNICODE标志。

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import re


data = u'temp1:        +31.0°C  (crit = +107.0°C)'
temp_re = re.compile(ur'(temp1:)\s+(\+|-)(\d+\.\d+)°C\s+' 
                     ur'\(crit\s+=\s+(\+|-)(\d+\.\d+)°C\).*', flags=re.UNICODE)

print temp_re.findall(data)

输出

[(u'temp1:', u'+', u'31.0', u'+', u'107.0')]

编辑

@netvope已经在问题评论中指出了这一点。

更新

来自 J.F. Sebastian 的笔记关于输入编码的评论:

check_output() returns binary data that sometimes can be text (that should have a known character encoding in this case and you can convert it to Unicode). Anyway ord(u'°') == 176 so it can not be encoded using ASCII encoding.

因此，要将输入数据解码为 unicode，基本上*您应该使用 locale.getpreferredencoding() 从系统区域设置进行编码，例如:

data = subprocess.check_output(...).decode(locale.getpreferredencoding())

正确编码的数据:

you'll get the same output without re.UNICODE in this case.

为什么基本上？因为在使用 cp1251 作为 preferredencoding 的俄罗斯 Win7 上，如果我们有例如 script.py 解码输出为 utf-8:

#!/usr/bin/env python
# -*- coding: utf8 -*-

print u'temp1: +31.0°C  (crit = +107.0°C)'.encode('utf-8')

我们需要解析它的输出:

subprocess.check_output(['python', 
                         'script.py']).decode(locale.getpreferredencoding())

将产生错误的结果:'В°' 而不是 °。

因此，在某些情况下，您需要知道输入数据的编码。

关于python - 如何用正则表达式可移植地解析(Unicode)度数符号？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8952430/

python - 如何用正则表达式可移植地解析(Unicode)度数符号？

上一篇：python - 在 Python 中解析 mbox 文件

下一篇：python - 'yield' 关键字的文档字符串标记