python - NLTK 命名实体识别数据集中的列

标签 python nlp nltk named-entity-recognition

感谢这里的“alvas”代码,Named Entity Recognition with Regular Expression: NLTK举个例子:

from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree

def get_continuous_chunks(text):
    chunked = ne_chunk(pos_tag(word_tokenize(text)))
    prev = None
    continuous_chunk = []
    current_chunk = []

    for i in chunked:
        if type(i) == Tree:
            current_chunk.append(" ".join([token for token, pos in i.leaves()]))
        elif current_chunk:
            named_entity = " ".join(current_chunk)
            if named_entity not in continuous_chunk:
                continuous_chunk.append(named_entity)
                current_chunk = []
        else:
            continue

    return continuous_chunk

txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))

输出是:

['GOP', 'Washington', 'House Republicans', 'Donald Trump']

我将此文本替换为:txt = df['content'][38] 从我的数据集中得到的结果:

['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']

此数据集有许多行和一个名为“content”的列。我的问题是如何使用此代码从该列中提取每一行的名称并将该名称存储在另一列和相应的行中?

import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)

最佳答案

尝试申请:

df['ne'] = df['content'].apply(get_continuous_chunks)

对于第二个示例中的代码,创建一个函数并以相同的方式应用它:

def my_st(text):
    tokenized_text = word_tokenize(text)
    return st.tag(tokenized_text)

df['st'] = df['content'].apply(my_st)

关于python - NLTK 命名实体识别数据集中的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41456250/

相关文章:

python - NLTK 提取分块分析树的术语

nlp - 使用 NLP 的实体识别和情感分析

nlp - POS 模式过滤器?

python - 我如何获取所选复选框的总和并将所选项目保存在数据库 mysql 中

python - 解析 penn 语法树以提取其语法规则

python - Django:在测试模式下以不同方式定义表单

python - 优化 WER(单词错误率)代码?

python-2.7 - Unicode解码错误: 'ascii' codec can't decode byte 0xc3 in position

python - 使用 unpack 方法来自 tika python 模块的警告消息

python - 为什么seaborn会在面网格中一遍又一遍地渲染相同的图形?