python - Python 中带有词干的 UnicodeDecodeError

我好难过

我有一个几千字的列表

x = ['company', 'arriving', 'wednesday', 'and', 'then', 'beach', 'how', 'are', 'you', 'any', 'warmer', 'there', 'enjoy', 'your', 'day', 'follow', 'back', 'please', 'everyone', 'go', 'watch', 's', 'new', 'video', 'you', 'know', 'the', 'deal', 'make', 'sure', 'to', 'subscribe', 'and', 'like', '<http>', 'you', 'said', 'next', 'week', 'you', 'will', 'be', 'the', 'one', 'picking', 'me', 'up', 'lol', 'hindi', 'na', 'tl', 'huehue', 'that', 'works', 'you', 'said', 'everyone', 'of', 'us', 'my', 'little', 'cousin', 'keeps', 'asking', 'if', 'i', 'wanna', 'play', 'and', "i'm", 'like', 'yes', 'but', 'with', 'my', 'pals', 'not', 'you', "you're", 'welcome', 'pas', 'quand', 'tu', 'es', 'vers', '<num>', 'i', 'never', 'get', 'good', 'mornng', 'texts', 'sad', 'sad', 'moment', 'i', 'think', 'ima', 'go', 'get', 'a', 'glass', 'of', 'milk', 'ahah', 'for', 'the', 'first', 'time', 'i', 'actually', 'know', 'what', 'their', 'doing', 'd', 'thank', 'you', 'happy', 'birthday', 'hope', "you're"...........]

现在，我已经确认这个列表中每个元素的类型都是一个字符串

types = []
for word in x:
    a.append(type(word))
print set(a)

>>>set([<type 'str'>])

现在，我尝试使用 NLTK 的 porter stemmer 来提取每个单词的词干

import nltk
porter = nltk.PorterStemmer()
stemmed_x = [porter.stem(word) for word in x]

我得到了这个明显与词干包和 unicode 不知何故相关的错误:

File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 633, in stem
    stem = self.stem_word(word.lower(), 0, len(word) - 1)
  File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 591, in stem_word
    word = self._step1ab(word)
  File "/Library/Python/2.7/site-packages/nltk-3.0.0b2-py2.7.egg/nltk/stem/porter.py", line 289, in _step1ab
    if word.endswith("ied"):
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 12: ordinal not in range(128)

我已经尝试了所有方法，使用 codecs.open，尝试将每个单词显式编码为 utf8 - 仍然产生相同的错误。

请指教。

编辑:

我应该提到这段代码在我运行 Ubuntu 的 PC 上运行完美。我最近买了一台 macbook pro，但出现了这个错误。我检查了我 mac 上的终端设置，它设置为 utf8 编码。

编辑 2:

有趣的是，通过这段代码，我分离出了问题词:

for w in x:
    try:
        porter.stem(w)
    except UnicodeDecodeError:
        print w 

#sagittarius”
#instadane…
#bleedblue”
#pr챕cieux
#على_شرفة_الماضي
#exploringsf…
#fishing…
#sindhubestfriend…
#الإستعداد_لإنهيار_ال_سعود
#jaredpreslar…
#femalepains”
#gobillings”
#juicing…
#instamood…

似乎它们的共同点是单词末尾有额外的标点符号，除了单词#pr챕cieux

最佳答案

您可能有一个多字节 UTF8 字符潜伏在周围，因为 0xe2 是 16-bit codepoint encoded as UTF-8 可能的第一个字节之一。 .由于您的程序采用 ASCII 字符，有效编码值从 0x00 到 0x7F，因此该值被拒绝。

您可能能够通过简单的理解来识别“坏”值，然后手动修复它(因为我从您的数据中假设您只想处理 ASCII 字符):

print [value for value in x if '\xe2' in x]

关于python - Python 中带有词干的 UnicodeDecodeError，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25565540/

python - Python 中带有词干的 UnicodeDecodeError

上一篇：python - Python代码中常量、参数等放在什么位置进行研究？

下一篇：python - Scrapy django模型导入错误