nlp - 以conll格式输出结果(POS-tagging, stanford pos tagger)

标签 nlp stanford-nlp pos-tagger output-formatting outputformat

我正在尝试使用 Stanford POS-tagger，我想问一下是否可以解析(实际上只有 pos 标签就足够了)英文文本并以 conll 格式输出结果。有这样的选择吗？

我正在使用 Stanford pos tagger 的完整 3.2.0 版本

非常感谢

最佳答案

当谈到 CONLL 格式时，我想您指的是 CONLL2000 分块任务格式:

   He        PRP  B-NP
   reckons   VBZ  B-VP
   the       DT   B-NP
   current   JJ   I-NP
   account   NN   I-NP
   deficit   NN   I-NP
   will      MD   B-VP
   narrow    VB   I-VP
   to        TO   B-PP
   only      RB   B-NP
   #         #    I-NP
   1.8       CD   I-NP
   billion   CD   I-NP
   in        IN   B-PP
   September NNP  B-NP
   .         .    O

CONLL分块任务格式一共有三列:

token(即单词)
POS 标签
BIO (begin, inside, outside) block /短语标签

遗憾的是，如果您使用 stanford MaxEnt 标记器，它只会给您token 和POS 信息，但没有BIO block 信息。

java -cp stanford-postagger.jar edu.stanford.nlp.tagger.maxent.MaxentTagger -model models/left3words-wsj-0-18.tagger -textFile short.txt -outputFormat tsv 2> /dev/null

使用上面的命令，Stanford POS 标记器已经为您提供了制表符分隔格式，只是它没有第 3 列(请参阅 http://nlp.stanford.edu/software/pos-tagger-faq.shtml):

   He        PRP
   reckons   VBZ
   the       DT
   ...

要获得 BIO 列，您需要要么:

一个统计组 block 器或
一个完整的解析器

请参阅 http://www-nlp.stanford.edu/links/statnlp.html 以获取分块器/解析器列表，如果您想坚持使用斯坦福工具，我建议使用斯坦福解析器，但它为您提供了括号内的解析格式，您必须进行一些后处理才能将其放入 CONLL2000格式，见http://nlp.stanford.edu/software/lex-parser.shtml

关于nlp - 以conll格式输出结果(POS-tagging, stanford pos tagger)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18948712/

上一篇：perl - 图片::草莓 Perl 5.12 上的 Magick

下一篇：jsf - Richfaces 工具提示组件控制参数

相关文章：

java - 自然语言处理——将非结构化书目转换为结构化元数据

java - 使用stanford NLP解析器获取原始文本

windows - NLTK v3.2 : Unable to nltk. pos_tag()

java - 在java中使用Stanford postagger，得到java.lang.InknownClassChangeError

algorithm - 使用 O(log(n)) 实现最近向量搜索算法

python - trie 的快速序列化

programming-languages - 对于编程语言来说，与 "natural language"的相似性是一个令人信服的卖点吗？

java - 从 NER 获取全名

python - 在 python stanfordnlp 中使用斯坦福依赖项(而不是通用依赖项)

java - hazm 库的 ValueError : Could not find stanford-postagger. jar 文件 - python NLP