从 here 下载了 CoreNLP 服务器和以下 these instruction ,当我包括 entitymentions
作为注释者:
wget --post-data 'Mark Ronson played a concert in New York.' 'localhost:9000/?properties={"tokenize.whitespace": "true", "annotators": "tokenize,ssplit,pos,entitymentions", "outputFormat": "json"}'
返回的json如下所示,虽然
ner
是按 token 添加的,没有提及列表。知道为什么吗?
(值得一提的是 corenlp.run 似乎也没有返回它们 - 似乎亮点是后处理的结果)。
{
"sentences": [
{
"index": 0,
"parse": "SENTENCE_SKIPPED_OR_UNPARSABLE",
"tokens": [
{
"index": 1,
"word": "Mark",
"originalText": "Mark",
"lemma": "Mark",
"characterOffsetBegin": 0,
"characterOffsetEnd": 4,
"pos": "NNP",
"ner": "PERSON"
},
{
"index": 2,
"word": "Ronson",
"originalText": "Ronson",
"lemma": "Ronson",
"characterOffsetBegin": 5,
"characterOffsetEnd": 11,
"pos": "NNP",
"ner": "PERSON"
},
{
"index": 3,
"word": "played",
"originalText": "played",
"lemma": "play",
"characterOffsetBegin": 12,
"characterOffsetEnd": 18,
"pos": "VBD",
"ner": "O"
},
{
"index": 4,
"word": "a",
"originalText": "a",
"lemma": "a",
"characterOffsetBegin": 19,
"characterOffsetEnd": 20,
"pos": "DT",
"ner": "O"
},
{
"index": 5,
"word": "concert",
"originalText": "concert",
"lemma": "concert",
"characterOffsetBegin": 21,
"characterOffsetEnd": 28,
"pos": "NN",
"ner": "O"
},
{
"index": 6,
"word": "in",
"originalText": "in",
"lemma": "in",
"characterOffsetBegin": 29,
"characterOffsetEnd": 31,
"pos": "IN",
"ner": "O"
},
{
"index": 7,
"word": "New",
"originalText": "New",
"lemma": "New",
"characterOffsetBegin": 32,
"characterOffsetEnd": 35,
"pos": "NNP",
"ner": "LOCATION"
},
{
"index": 8,
"word": "York.",
"originalText": "York.",
"lemma": "York.",
"characterOffsetBegin": 36,
"characterOffsetEnd": 41,
"pos": "NNP",
"ner": "LOCATION"
}
]
}
]
}
最佳答案
不管是好是坏,我们目前不会将实体提及输出到我们的输出器。推荐的解决方法是以与实体提及注释器相同的方式对数据进行后处理:同一 NER 的连续跨度被视为实体提及。我相信实体提及对象中的所有注释也附加到组件标记上。
关于stanford-nlp - CoreNLP 服务器不返回实体提及,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35582020/