我开始使用 NLTK 并想标记一个荷兰语句子,但我在指定语料库时遇到了问题。
from nltk.tag import pos_tag
from nltk.tokenize import word_tokenize
from nltk.corpus import alpino
pos_tag(word_tokenize("Python is een goede data science taal."), tagset = 'alpino')
给予,
[('Python', 'UNK'),
('is', 'UNK'),
('een', 'UNK'),
('goede', 'UNK'),
('data', 'UNK'),
('science', 'UNK'),
('taal', 'UNK'),
('.', 'UNK')]
很明显我没有正确指定语料库。我下载了 alpino 语料库。谁能帮我弄清楚如何正确指定语料库?
默认的 nltk.pos_tag
是针对英文文本训练的,您必须在 alpino
语料库上训练一个新的标注器来滚动您自己的荷兰语标注器。
但请注意,该模型将与:
来自 UnigramTagger
和 BigramTagger
的例子:
>>> from nltk.corpus import alpino as alp
>>> from nltk.tag import UnigramTagger, BigramTagger
>>> training_corpus = alp.tagged_sents()
>>> unitagger = UnigramTagger(training_corpus)
>>> bitagger = BigramTagger(training_corpus, backoff=unitagger)
>>> pos_tag = bitagger.tag
>>> sent = 'NLTK is een goeda taal voor NLP'.split()
>>> pos_tag(sent)
[('NLTK', None), ('is', u'verb'), ('een', u'det'), ('goeda', None), ('taal', u'noun'), ('voor', u'prep'), ('NLP', None)]
使用PerceptronTagger
:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> training_corpus = list(alp.tagged_sents())
>>> tagger = PerceptronTagger(load=True)
>>> tagger.train(training_corpus)
>>> sent = 'NLTK is een goeda taal voor het leren over NLP'.split()
>>> tagger.tag(sent)
[('NLTK', u'noun'), ('is', u'verb'), ('een', u'det'), ('goeda', u'adj'), ('taal', u'noun'), ('voor', u'prep'), ('het', u'det'), ('leren', u'noun'), ('over', u'prep'), ('NLP', u'noun')
正如@WasiAhmed 指出的,这是另一个很好的例子:https://github.com/evanmiltenburg/Dutch-tagger正如@evanmiltenburg 在 github 上所说,尝试在生产中使用更快的标记器。
已编辑
要评估标注器,您可以像这样提供一个test_set
:
>>> from nltk.tag import PerceptronTagger
>>> from nltk.corpus import alpino as alp
>>> alp_tagged_sents = list(alp.tagged_sents())
>>> len(alp_tagged_sents)
7136
>>> last_train_sent = int(len(alp_tagged_sents) / 10 * 9)
>>> train_set = alp_tagged_sents[:last_train_sent]
>>> test_set = alp_tagged_sents[last_train_sent:]
然后使用tagger.evaluate()
函数获取准确率,.evaluate()
函数的输入与的输入相同>.train()
函数,即一个句子列表,每个句子是一个('word', 'tag')
元组列表:
>>> tagger = PerceptronTagger(load=False)
>>> tagger.train(train_set)
>>> tagger.evaluate(test_set)
0.927672285043738