stanford-nlp - 在 StanfordCoreNLPServer 输出中同时具有 NER 和 RegexNER 标签？

我正在使用 StanfordCoreNLPServer 从文本中提取一些信息(例如表面、街道名称)

街道由经过专门训练的 NER 模型给出，表面由通过 RegexNER 的简单正则表达式给出。

它们中的每一个单独工作都很好，但是当一起使用时，输出中只有 NER 结果，在 ner 下标签。为什么没有 regexner标签？有没有办法也有 RegexNER 结果？

信息:

StanfordCoreNLP v3.6.0

使用的网址:

'http://127.0.0.1:9000/'
'?properties={"annotators":"tokenize,ssplit,pos,ner,regexner", '
'"pos.model":"edu/stanford/nlp/models/pos-tagger/french/french.tagger",'
'"tokenize.language":"fr",'
'"ner.model":"ner-model.ser.gz", ' # custom NER model with STREET labels
'"regexner.mapping":"rules.tsv", ' # SURFACE label
'"outputFormat": "json"}'

按照建议 here , regexner注释者是后 ner ，但仍然...

当前输出(提取):

{u'index': 4, u'word': u'dans', u'lemma': u'dans', u'pos': u'P', u'characterOffsetEnd': 12, u'characterOffsetBegin': 8, u'originalText': u'dans', u'ner': u'O'}
{u'index': 5, u'word': u'la', u'lemma': u'la', u'pos': u'DET', u'characterOffsetEnd': 15, u'characterOffsetBegin': 13, u'originalText': u'la', u'ner': u'O'}
{u'index': 6, u'word': u'rue', u'lemma': u'rue', u'pos': u'NC', u'characterOffsetEnd': 19, u'characterOffsetBegin': 16, u'originalText': u'rue', u'ner': u'STREET'}
{u'index': 7, u'word': u'du', u'lemma': u'du', u'pos': u'P', u'characterOffsetEnd': 22, u'characterOffsetBegin': 20, u'originalText': u'du', u'ner': u'STREET'}
[...]
{u'index': 43, u'word': u'165', u'lemma': u'165', u'normalizedNER': u'165.0', u'pos': u'DET', u'characterOffsetEnd': 196, u'characterOffsetBegin': 193, u'originalText': u'165', u'ner': u'NUMBER'}
{u'index': 44, u'word': u'm', u'lemma': u'm', u'pos': u'NC', u'characterOffsetEnd': 198, u'characterOffsetBegin': 197, u'originalText': u'm', u'ner': u'O'}
{u'index': 45, u'word': u'2', u'lemma': u'2', u'normalizedNER': u'2.0', u'pos': u'ADJ', u'characterOffsetEnd': 199, u'characterOffsetBegin': 198, u'originalText': u'2', u'ner': u'NUMBER'}

预期输出:我希望最后 3 个项目标有 SURFACE ，即 RegexNER结果。

如果需要更多详细信息，请告诉我。

最佳答案

这是RegexNER documentation说到这个:

RegexNER will not overwrite an existing entity assignment, unless you give it permission in a third tab-separated column, which contains a comma-separated list of entity types that can be overwritten. Only the non-entity O label can always be overwritten, but you can specify extra entity tags which can always be overwritten as well.

Bachelor of (Arts|Laws|Science|Engineering|Divinity) DEGREE

Lalor LOCATION PERSON

Labor ORGANIZATION

我不确定您的映射文件究竟是什么样子，但如果它只是将实体映射到标签，那么原始 NER 会将您的实体标记为 NUMBER，而 RegexNER 将无法覆盖它们。如果您明确声明应该在映射文件中将某些 NUMBER 实体覆盖为 SURFACE，那么它应该可以工作。

关于stanford-nlp - 在 StanfordCoreNLPServer 输出中同时具有 NER 和 RegexNER 标签？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37883035/

stanford-nlp - 在 StanfordCoreNLPServer 输出中同时具有 NER 和 RegexNER 标签？

上一篇：raspberry-pi - LIRC 发送 : could not connect to socket irsend: No such file or directory

下一篇：r - geom_histogram : wrong bins?