我正在使用 NLTK 中的 NER 在句子中查找人物、地点和组织。我能够产生这样的结果:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
是否可以通过使用它来将事物组合在一起? 我想要的是这样的:
u'Remaking'/ u'O', u'The'/u'O', (u'Republican', u'Party')/u'ORGANIZATION'
谢谢!
最佳答案
它看起来很长,但它确实有效:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]]) if len(wordpos)!=2 else wordpos for wordpos in chunked]
print clean_chunked
[输出]:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican Party', u'ORGANIZATION')]
更多详情:
第一个“带内存”的 for 循环实现了这样的效果:
[(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')]
您会意识到所有名称实体在一个元组中都会有超过 2 个项目,您想要的是单词作为列表中的元素,即 中的
,因此您将执行如下操作以获得偶数元素:'Republican Party'
(u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x = [0,1,2,3,4,5,6]
>>> x[::2]
[0, 2, 4, 6]
>>> x[1::2]
[1, 3, 5]
然后你也意识到 NE 元组中的最后一个元素是你想要的标签,所以你会这样做 `
>>> x = (u'Republican', u'ORGANIZATION', u'Party', u'ORGANIZATION')
>>> x[::2]
(u'Republican', u'Party')
>>> x[-1]
u'ORGANIZATION'
它有点临时和冗长,但我希望它能有所帮助。这是一个函数,Blessed Christmas:
ner_output = [(u'Remaking', u'O'), (u'The', u'O'), (u'Republican', u'ORGANIZATION'), (u'Party', u'ORGANIZATION')]
def rechunk(ner_output):
chunked, pos = [], ""
for i, word_pos in enumerate(ner_output):
word, pos = word_pos
if pos in ['PERSON', 'ORGANIZATION', 'LOCATION'] and pos == prev_tag:
chunked[-1]+=word_pos
else:
chunked.append(word_pos)
prev_tag = pos
clean_chunked = [tuple([" ".join(wordpos[::2]), wordpos[-1]])
if len(wordpos)!=2 else wordpos for wordpos in chunked]
return clean_chunked
print rechunk(ner_output)
关于python - 从 NLTK 格式分块斯坦福命名实体识别器 (NER) 输出,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27629130/