我有新闻文章,我想使用 deepavlov 对该文章进行 NER。该实体使用 BIO 标记方案。这里“B”表示实体的开始,“I”代表“内部”,用于除第一个实体之外的所有包含该实体的单词,“O”表示不存在实体。 NER代码如下:
def listOfTuples(list1, list2):
return list(map(lambda x, y:(x,y), list1, list2))
ner_result = []
for x in split:
for y in split[0]:
news_ner = ner_model([str(y)])
teks = news_ner[0][0]
tag = news_ner[1][0]
ner_result.extend(listOfTuples(teks, tag))
print([i for i in ner_result if i[1] != 'O'])
嗯,NER结果是这样的。
[('KOMPAScom', 'B-ORG'), ('Kompascom', 'I-ORG'), ('IFCN', 'B-ORG'), ('-', 'I-ORG'), ('International', 'I-ORG'), ('Fact', 'I-ORG'), ('-', 'I-ORG'), ('Checking', 'I-ORG'), ('Network', 'I-ORG'), ('Kompascom', 'B-ORG'), ('49', 'B-CARDINAL'), ('IFCN', 'B-ORG'), ('Kompascom', 'B-ORG'), ('Redaksi', 'B-ORG'), ('Kompascom', 'I-ORG'), ('Wisnu', 'B-PERSON'), ('Nugroho', 'I-PERSON'), ('Jakarta', 'B-GPE'), ('Rabu', 'B-DATE'), ('17', 'I-DATE'), ('/', 'I-DATE'), ('10', 'I-DATE'), ('/', 'I-DATE'), ('2018', 'I-DATE'), ('KOMPAScom', 'B-ORG'), ('Redaksi', 'B-ORG'), ('Kompascom', 'I-ORG'), ('Wisnu', 'B-PERSON'), ('Nugroho', 'I-PERSON'), ('Kompascom', 'B-ORG'), ('Bentara', 'I-ORG'), ('Budaya', 'I-ORG'), ('Jakarta', 'I-ORG'), ('Palmerah', 'I-ORG')]
我想删除B和I的标签,然后合并标签B和I中的文本,所以输出是这样的。
[('KOMPAScom Kompascom', 'ORG'), ('IFCN - International Fact - Checking Network', 'ORG'), ('Kompascom', 'ORG'), ('49', 'CARDINAL'), ('IFCN', 'ORG'), ('Kompascom', 'ORG'), ('Redaksi Kompascom', 'ORG'), ('Wisnu Nugroho', 'PERSON'), ('Jakarta', 'GPE'), ('Rabu 17/10/2018', 'DATE'), ('KOMPAScom', 'ORG'), ('Redaksi Kompascom', 'ORG'), ('Wisnu Nugroho', 'PERSON'), ('Kompascom Bentara Budaya Jakarta Palmerah', 'ORG')]
你有什么想法吗?
最佳答案
您可以简单地迭代标记的文本并连接属于同一实体的标记。它并不优雅,但很有效。像这样的事情:
def collapse(ner_result):
# List with the result
collapsed_result = []
# Buffer for tokens belonging to the most recent entity
current_entity_tokens = []
current_entity = None
# Iterate over the tagged tokens
for token, tag in ner_result:
if tag == "O":
continue
# If an enitity span starts ...
if tag.startswith("B-"):
# ... if we have a previous entity in the buffer, store it in the result list
if current_entity is not None:
collapsed_result.append(
(" ".join(current_entity_tokens), current_entity))
current_entity = tag[2:]
# The new entity has so far only one token
current_entity_tokens = [token]
# If the entity continues ...
elif tag == "I-" + current_entity:
# Just add the token buffer
current_entity_tokens.append(token)
else:
raise ValueError("Invalid tag order.")
# The last entity is still in the buffer, so add it to the result
# ... but only if there were some entity at all
if current_entity is not None:
collapsed_result.append(
(" ".join(current_entity_tokens), current_entity))
return collapsed_result
关于python - 删除 NER 处的 B 和 I 标记,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59638928/