python - 如何使用 NLTK snowball 词干提取器来提取西类牙语单词列表 Python

标签 python nltk

我正在尝试使用 NLTK 雪球词干提取器来词干西类牙语,但我遇到了一些我不知道的编码问题。

这是我要操作的例句:

En diciembre, los precios de la energía subieron un 1,4 por ciento, los de la vivienda aumentaron un 0,1 por ciento y los precios de la vestimenta se mantuvieron sin cambios, mientras que los de los automóviles nuevos bajaron un 0,1 por ciento y los de los pasajes de avión cayeron el 0,7 por ciento.

首先,我使用代码从 xml 文件中读取句子:

from nltk.stem.snowball import SnowballStemmer
import xml.etree.ElementTree as ET

stemmer = SnowballStemmer("spanish")
sentence = ET.tostring(context, encoding='utf-8', method="text").lower()

然后在将句子标记化以获得单词列表之后,我尝试对每个单词进行词干处理:

stem = stemmer.stem(words[headIndex - index])

错误来自这一行:

Traceback (most recent call last):
  File "main.py", line 150, in <module>
    main()
  File "main.py", line 142, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 86, in englishXml
    stem = stemmer.stem(words[headIndex - index])
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 3404, in stem
    r1, r2 = self._r1r2_standard(word, self.__vowels)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/nltk/stem/snowball.py", line 232, in _r1r2_standard
    if word[i] not in vowels and word[i-1] in vowels:
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

我也尝试从没有“utf-8”编码的xml文件中读取句子,但问题是“.lower()”在那种情况下不起作用:

sentence = ET.tostring(context, method="text").lower()

在这种情况下错误变为:

Traceback (most recent call last):
  File "main.py", line 154, in <module>
    main()
  File "main.py", line 146, in main
    vectorDict, vocabulary = englishXml(language)
  File "main.py", line 63, in englishXml
    sentence = ET.tostring(context, method="text").lower()
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 814, in write
    _serialize_text(write, self._root, encoding)
  File "/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/xml/etree/ElementTree.py", line 1006, in _serialize_text
    write(part.encode(encoding))
UnicodeEncodeError: 'ascii' codec can't encode character u'\xed' in position 18: ordinal not in range(128)

提前致谢!

最佳答案

尝试在词干提取之前添加这个

sentence = sentence.decode('utf8')

关于python - 如何使用 NLTK snowball 词干提取器来提取西类牙语单词列表 Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29184783/

相关文章:

Python 逻辑 Black Jack sim

java - 使用 OpenNLP 进行句子检测

python - 在 Google App Engine 上使用 Python NLTK (2.0b5)

python - 使用 cython 从 c 调用 python 代码

java - Python 生成的 XML 中的错误

python - 如何找到 n 维 numpy 数组的第 i 个最大元素的索引?

nltk - 使用 nltk 将日期识别为命名实体?

python - 德语词干分析器不会删除女性后缀 "-in"和 "-innen"

python - nltk:word_tokenize 更改引号

python - 获取URL子子域