classification - NLTK 感知器标记器 "TypeError: ' LazySubsequence 对象不支持项目分配”

标签 classification nltk anaconda python-3.5 perceptron

我想尝试在 Python 3.5 的 nltk 包中使用 PerceptronTagger,但我收到错误 TypeError: 'LazySubsequence' object does不支持项目分配

我想用 universal 标签集的棕色语料库中的数据对其进行训练。

这是我遇到问题时运行的代码。

import nltk,math
tagged_sentences = nltk.corpus.brown.tagged_sents(categories='news',tagset='universal')
i = math.floor(len(tagged_sentences)*0.2)
testing_sentences = tagged_sentences[0:i]
training_sentences = tagged_sentences[i:]
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(training_sentences)

它不会正确训练,并提供以下堆栈跟踪。

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-10-61332d63d2c3> in <module>()
      1 perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
----> 2 perceptron_tagger.train(training_sentences)

/home/nathan/anaconda3/lib/python3.5/site-packages/nltk/tag/perceptron.py in train(self, sentences, save_loc, nr_iter)
    192                     c += guess == tags[i]
    193                     n += 1
--> 194             random.shuffle(sentences)
    195             logging.info("Iter {0}: {1}/{2}={3}".format(iter_, c, n, _pc(c, n)))
    196         self.model.average_weights()

/home/nathan/anaconda3/lib/python3.5/random.py in shuffle(self, x, random)
    270                 # pick an element in x[:i+1] with which to exchange x[i]
    271                 j = randbelow(i+1)
--> 272                 x[i], x[j] = x[j], x[i]
    273         else:
    274             _int = int

TypeError: 'LazySubsequence' object does not support item assignment

它似乎来自 random 模块中的 shuffle 函数,但这似乎并不正确。

还有其他可能导致问题的原因吗? 有人遇到过这个问题吗?

我正在使用 Anaconda Python 3.5 在 Ubuntu 16.04.1 上运行它。 nltk 版本为 3.2.1

最佳答案

调试

nltk 源代码中做一些 greping 找到了答案。

在文件 site-packages/nltk/util.py 中声明了该类。

class LazySubsequence(AbstractLazySequence):
    """                                                                                                                                                                  
    A subsequence produced by slicing a lazy sequence.  This slice                                                                                                       
    keeps a reference to its source sequence, and generates its values                                                                                                   
    by looking them up in the source sequence.                                                                                                                           
    """

在解释器的另一次快速测试后,我看到了有关 tagged_sentencestype() 的以下详细信息

>>> import nltk
>>> tagged_sentences = nltk.corpus.brown.tagged_sents(categories='news',tagset='universal')
>>> type(tagged_sentences)
<class 'nltk.corpus.reader.util.ConcatenatedCorpusView'>

我在文件中看到 site-packages/nltk/corpus/reader/util.py

class ConcatenatedCorpusView(AbstractLazySequence):
    """                                                                                                                                                                  
    A 'view' of a corpus file that joins together one or more                                                                                                            
    ``StreamBackedCorpusViews<StreamBackedCorpusView>``.  At most                                                                                                        
    one file handle is left open at any time.                                                                                                                            
    """

random 包的最终测试证明我创建 tagged_sentences 的方式存在问题

>>> import random
>>> random.shuffle(training_sentences)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-30-0b03f0366949> in <module>()
      1 import random
----> 2 random.shuffle(training_sentences)
      3 
      4 
      5 

/home/nathan/anaconda3/lib/python3.5/random.py in shuffle(self, x, random)
    270                 # pick an element in x[:i+1] with which to exchange x[i]
    271                 j = randbelow(i+1)
--> 272                 x[i], x[j] = x[j], x[i]
    273         else:
    274             _int = int

TypeError: 'LazySubsequence' object does not support item assignment

解决方案

要解决此错误,只需显式创建 nltk.corpus.brown 包中的句子列表,然后 random 即可正确打乱数据。

import nltk,math
# explicitly make list, then LazySequence will traverse all items
tagged_sentences = [sentence for sentence in nltk.corpus.brown.tagged_sents(categories='news',tagset='universal')]
i = math.floor(len(tagged_sentences)*0.2)
testing_sentences = tagged_sentences[0:i]
training_sentences = tagged_sentences[i:]
perceptron_tagger = nltk.tag.perceptron.PerceptronTagger(load=False)
perceptron_tagger.train(training_sentences)
# no error, yea!

现在标记按预期工作了。

>>> perceptron_tagger_preds = []
>>> for test_sentence in testing_sentences:
...    perceptron_tagger_preds.append(perceptron_tagger.tag([word for word,_ in test_sentence]))
>>> print(perceptron_tagger_preds[676])
[('Formula', 'NOUN'), ('is', 'VERB'), ('due', 'ADJ'), ('this', 'DET'), ('week', 'NOUN')]

关于classification - NLTK 感知器标记器 "TypeError: ' LazySubsequence 对象不支持项目分配”,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39622121/

相关文章:

python - Jupyter conda 选项卡 'An error occurred while retrieving package information.'

anaconda - 如何使用 Anaconda 发行版更新到 jupyterlab 4?

python-3.x - 训练结束后,nltk naivebayes 分类器如何学习更多特征集?

machine-learning - K 最近邻算法

python - Python 中的另一个 unicode 困惑

python - 如何在 scikit TfidfVectorizer 中为专有名词赋予更多权重

python - Pandas 错误 - 遇到无效值

machine-learning - 仅包含 "yes"个实例的二元分类

r - 解释 R 输出 Rpart 分类树代理分割

python - 在 NLTK 中使用 block 标签(而非 NER)在句子中创建关系 |自然语言处理