python - NLTK BigramTagger 不标记半个句子

有人可以解释 NLTK 的 BigramTagger 在这些示例中的行为吗？

我通过

实例化了标注器

bi= BigramTagger(brown.tagged_sents(categories='news')[:500])

现在，我想在一个特定的句子上使用它。

>>> bi.tag(brown_sents[2])
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', u'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

效果很好，但是，嘿，这都是已知数据。让我换一个词，看看它是否引起了什么。

>>> sent=brown_sents[2]
>>> sent[5]
u'been'
>>> sent[5] = u'was'
>>> bi.tag(sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', u'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'was', None), (u'charged', None), (u'by', None), (u'Fulton', None), (u'Superior', None), (u'Court', None), (u'Judge', None), (u'Durwood', None), (u'Pye', None), (u'to', None), (u'investigate', None), (u'reports', None), (u'of', None), (u'possible', None), (u'``', None), (u'irregularities', None), (u"''", None), (u'in', None), (u'the', None), (u'hard-fought', None), (u'primary', None), (u'which', None), (u'was', None), (u'won', None), (u'by', None), (u'Mayor-nominate', None), (u'Ivan', None), (u'Allen', None), (u'Jr.', None), (u'.', None)]

现在我希望看到更改的元组，(u'been', u'BEN') 现在是 (u'been', None)。为什么现在句子中 it 之后的所有内容都没有标记？这些词被标记为与另一个词相关，而不是“曾经”。

任何关于使用标记句子的建议也将不胜感激。

最佳答案

使用 *gramTagger 时必须设置退避标注器，这样如果在训练数据中没有看到特定的 ngram，它将退避到在低阶 ngram 上训练的标注器。请参阅 http://www.nltk.org/book/ch05.html 中的“组合标记器”部分

>>> from nltk import DefaultTagger, UnigramTagger, BigramTagger
>>> from nltk.corpus import brown
>>> text = brown.tagged_sents(categories='news')[:500]
>>> t0 = DefaultTagger('NN')
>>> t1 = UnigramTagger(text, backoff=t0)
>>> t2 = BigramTagger(text, backoff=t1)

>>> test_sent = brown.sents()[502]
>>> test_sent
[u'Noting', u'that', u'Plainfield', u'last', u'year', u'had', u'lost', u'the', u'Mack', u'Truck', u'Co.', u'plant', u',', u'he', u'said', u'industry', u'will', u'not', u'come', u'into', u'this', u'state', u'until', u'there', u'is', u'tax', u'reform', u'.']
>>> t2.tag(test_sent)
[(u'Noting', u'VBG'), (u'that', u'CS'), (u'Plainfield', u'NP-HL'), (u'last', u'AP'), (u'year', u'NN'), (u'had', u'HVD'), (u'lost', u'VBD'), (u'the', u'AT'), (u'Mack', 'NN'), (u'Truck', 'NN'), (u'Co.', u'NN-TL'), (u'plant', 'NN'), (u',', u','), (u'he', u'PPS'), (u'said', u'VBD'), (u'industry', 'NN'), (u'will', u'MD'), (u'not', u'*'), (u'come', u'VB'), (u'into', u'IN'), (u'this', u'DT'), (u'state', u'NN'), (u'until', 'NN'), (u'there', u'EX'), (u'is', u'BEZ'), (u'tax', 'NN'), (u'reform', 'NN'), (u'.', u'.')]

并证明它适用于您在问题中的示例；P

>>> test_sent = brown.sents()[2]
>>> test_sent
[u'The', u'September-October', u'term', u'jury', u'had', u'been', u'charged', u'by', u'Fulton', u'Superior', u'Court', u'Judge', u'Durwood', u'Pye', u'to', u'investigate', u'reports', u'of', u'possible', u'``', u'irregularities', u"''", u'in', u'the', u'hard-fought', u'primary', u'which', u'was', u'won', u'by', u'Mayor-nominate', u'Ivan', u'Allen', u'Jr.', u'.']
>>> t2.tag(test_sent)
[(u'The', u'AT'), (u'September-October', u'NP'), (u'term', 'NN'), (u'jury', u'NN'), (u'had', u'HVD'), (u'been', u'BEN'), (u'charged', u'VBN'), (u'by', u'IN'), (u'Fulton', u'NP-TL'), (u'Superior', u'JJ-TL'), (u'Court', u'NN-TL'), (u'Judge', u'NN-TL'), (u'Durwood', u'NP'), (u'Pye', u'NP'), (u'to', u'TO'), (u'investigate', u'VB'), (u'reports', u'NNS'), (u'of', u'IN'), (u'possible', u'JJ'), (u'``', u'``'), (u'irregularities', u'NNS'), (u"''", u"''"), (u'in', u'IN'), (u'the', u'AT'), (u'hard-fought', u'JJ'), (u'primary', 'NN'), (u'which', u'WDT'), (u'was', u'BEDZ'), (u'won', u'VBN'), (u'by', u'IN'), (u'Mayor-nominate', u'NN-TL'), (u'Ivan', u'NP'), (u'Allen', u'NP'), (u'Jr.', u'NP'), (u'.', u'.')]

在某些时候，您可能会意识到 Python NLTK pos_tag not returning the correct part-of-speech tag

关于python - NLTK BigramTagger 不标记半个句子，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/39167671/

python - NLTK BigramTagger 不标记半个句子

上一篇：python - 在Python中打印所有匹配的JSON字典

下一篇：python - 如何在Python中获得立方根