python - 从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出

我正在使用 NLTK 中的 NER 在句子中查找人物、地点和组织。我能够产生这样的结果:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]

是否可以通过使用它来将事物组合在一起？我想要的是这样的:

u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'

谢谢!

最佳答案

它看起来很长，但它确实有效:

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
    word, pos = word_pos
    if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
        chunked[-1]+=word_pos
    else:
        chunked.append(word_pos)
    prev_tag = pos

clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]

print clean_chunked

[输出]:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]

更多详情:

第一个“带内存”的 for 循环实现了这样的效果:

[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]

您会意识到所有名称实体在一个元组中都会有超过 2 个项目，您想要的是单词作为列表中的元素，即 中的 'Republican Party' (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')，因此您将执行如下操作以获得偶数元素:

>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]

然后你也意识到 NE 元组中的最后一个元素是你想要的标签，所以你会这样做 `

>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'

它有点临时和冗长，但我希望它能有所帮助。这是一个函数，Blessed Christmas:

ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]


def rechunk(ner_output):
    chunked, pos = [], ""
    for i, word_pos in enumerate(ner_output):
        word, pos = word_pos
        if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
            chunked[-1]+=word_pos
        else:
            chunked.append(word_pos)
        prev_tag = pos


    clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) 
                    if len(wordpos)!=2 else wordpos for wordpos in chunked]

    return clean_chunked


print rechunk(ner_output)

关于python - 从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27629130/

python - 从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出

上一篇：python - pycharm中的IdeaVim插件不支持长按连续滚动？

下一篇：python 如何修剪 csv DictReader 键中的尾随空格