python - string 字符串中 utf-8 单词的索引

通过以下代码，我在 mac 和 ubuntu 上获得了不同的索引值。两者都是 64 位机器，运行 python 2.7.8。 messages.json 文件有一个字符串，开头有一些 utf-8 字符。文件内容为:

 🌟🌟🌟🌟🌟🌟🌟🌟🌟 #Bangalore fine dinning table bookings in best price ⚡⚡⚡⚡⚡⚡⚡⚡⚡

Python代码如下:

import re

f = open('messages.json', 'r')
text = f.read().decode('UTF-8')
f.close()

print type(text)

for m in re.finditer('#Bangalore', text): 
    s = m.start()
    e = m.end()
    print s, e
    print text[s:e]

在 Ubuntu 上:

<type 'unicode'>
11 21
#Bangalore

在 Mac 上:

<type 'unicode'>
20 30
#Bangalore

最佳答案

问题是您的字符串包含大于 0xFFFF 的代码点(“星体”字符)。 Python(3.3 之前)有两个版本:“窄”和“宽”。窄版本仅支持 16 位 unicode，并且需要两个星体单位:

Python 2.7.5 (default, Mar  9 2014, 22:15:05) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.0.68)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
65535
>>> s = u'🌟#Bangalore'
>>> s.index('#')
2

“wide”构建使用 32 位并用一个单位表示所有 unicode 字符:

Python 2.7.2+ (default, Jul 20 2012, 22:15:08) 
[GCC 4.6.1] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> sys.maxunicode
1114111
>>> s = u'🌟#Bangalore'
>>> s.index('#')
1

可能的解决方法是

使用现代 Python
install a wide python on OSX
重写代码，使其不需要绝对位置

关于python - string 字符串中 utf-8 单词的索引，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29025617/

上一篇：python - 如何计算 Scipy 中多元高斯分布的概率？

下一篇：Python 字符串 : obtaining what's left after python string slice

相关文章：

java - 在数组索引处查找字符串？ java

Python编码问题(Utf-8，匈牙利语)

mysql - utf8_general_ci 和 utf8_unicode_ci 有什么区别？

c - 在 cmd 上打印到 stderr 无法打印非 ASCII UTF-8 文本的第一个字符

python - 如何使用 beautifulsoup 为 html 嵌套标签定义 findAll

python - 如何在 python 3 中以相反的顺序将行从输入文件写入输出文件

string - 如何在 Rust 中将字符串的字符串转换为字符串的数组/向量

Evernote XML 上的 Python LXML 解析错误

python - 如何防止 C 共享库在 python 中的标准输出上打印？

python:在一个字符后拆分字符串