python - 如何从 BIO 分块句子中提取分块? - Python

标签 python list nlp text-parsing text-chunking

给一个输入句子,有BIO chunk tags :

[('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]

我需要提取相关的短语,例如如果我想提取 'NP',我需要提取包含 B-NPI-NP 的元组片段。

[输出]:

[('What', '0'), ('the airspeed', '2-3'), ('an unladen swallow', '5-6-7')]

(注意:提取元组中的数字代表 token 索引。)

我尝试使用以下代码提取它:

def extract_chunks(tagged_sent, chunk_type):
    current_chunk = []
    current_chunk_position = []
    for idx, word_pos in enumerate(tagged_sent):
        word, pos = word_pos
        if '-'+chunk_type in pos: # Append the word to the current_chunk.
            current_chunk.append((word))
            current_chunk_position.append((idx))
        else:
            if current_chunk: # Flush the full chunk when out of an NP.
                _chunk_str = ' '.join(current_chunk) 
                _chunk_pos_str = '-'.join(map(str, current_chunk_position))
                yield _chunk_str, _chunk_pos_str 
                current_chunk = []
                current_chunk_position = []
    if current_chunk: # Flush the last chunk.
        yield ' '.join(current_chunk), '-'.join(current_chunk_position)


tagged_sent = [('What', 'B-NP'), ('is', 'B-VP'), ('the', 'B-NP'), ('airspeed', 'I-NP'), ('of', 'B-PP'), ('an', 'B-NP'), ('unladen', 'I-NP'), ('swallow', 'I-NP'), ('?', 'O')]
print (list(extract_chunks(tagged_sent, chunk_type='NP')))

但是当我有相同类型的相邻 block 时:

tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]

print (list(extract_chunks(tagged_sent, chunk_type='NP')))

它输出这个:

[('The Mitsubishi Electric Company Managing Director', '0-1-2-3-4-5'), ('ramen', '7')]

而不是想要的:

[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

如何从上面的代码中解决这个问题?

除了上面的代码如何完成之外,是否有更好的解决方案来提取特定 chunk_type 的所需 block ?

最佳答案

试试这个,它会提取所有类型的 block 及其各自单词的索引。

def extract_chunks(tagged_sent, chunk_type='NP'):
    out_sen = []
    for idx, word_pos in enumerate(tagged_sent):
        word,bio = word_pos
        boundary,tag = bio.split("-") if "-" in bio else ('','O')
        if tag != chunk_type:continue
        if boundary == "B":
            out_sen.append([word, str(idx)])
        elif boundary == "I":
            out_sen[-1][0] += " "+ word
            out_sen[-1][-1] += "-"+ str(idx)
        else:
            out_sen.append([word, str(idx)])
    return out_sen

演示:

>>> tagged_sent = [('The', 'B-NP'), ('Mitsubishi', 'I-NP'),  ('Electric', 'I-NP'), ('Company', 'I-NP'), ('Managing', 'B-NP'), ('Director', 'I-NP'), ('ate', 'B-VP'), ('ramen', 'B-NP')]
>>> output_sent = extract_chunks(tagged_sent)
>>> print map(tuple, output_sent)
[('The Mitsubishi Electric Company', '0-1-2-3'), ('Managing Director', '4-5'), ('ramen', '7')]

关于python - 如何从 BIO 分块句子中提取分块? - Python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/32333312/

相关文章:

python - django 1.8 在另一个 View 中使用一个 View

c - 如何使用 Linux list.h API 避免内存泄漏

nlp - NLP 停用词列表

python - 使用 Spacy 处理语法错误

python - 在python中设置环境变量来运行程序

python - pandas groupby 计数率

python - 迭代列表并按位置从嵌套字典返回值

r - 使用 sapply 的列表和矩阵

python - 干函数错误: stem required one positional argument

python - 无法通过 Pandas 中的 lambda 填充多列中的 NaN 值