感谢这里的“alvas”代码,Named Entity Recognition with Regular Expression: NLTK举个例子:
from nltk import ne_chunk, pos_tag
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
def get_continuous_chunks(text):
chunked = ne_chunk(pos_tag(word_tokenize(text)))
prev = None
continuous_chunk = []
current_chunk = []
for i in chunked:
if type(i) == Tree:
current_chunk.append(" ".join([token for token, pos in i.leaves()]))
elif current_chunk:
named_entity = " ".join(current_chunk)
if named_entity not in continuous_chunk:
continuous_chunk.append(named_entity)
current_chunk = []
else:
continue
return continuous_chunk
txt = 'The new GOP era in Washington got off to a messy start Tuesday as House Republicans,under pressure from President-elect Donald Trump.'
print (get_continuous_chunks(txt))
输出是:
['GOP', 'Washington', 'House Republicans', 'Donald Trump']
我将此文本替换为:txt = df['content'][38]
从我的数据集中得到的结果:
['Ina', 'Tori K.', 'Martin Cuilla', 'Phillip K', 'John J Lavorato']
此数据集有许多行和一个名为“content”的列。我的问题是如何使用此代码从该列中提取每一行的名称并将该名称存储在另一列和相应的行中?
import os
from nltk.tag import StanfordNERTagger
from nltk.tokenize import word_tokenize
from nltk.tree import Tree
st = StanfordNERTagger(stanford_classifier, stanford_ner_path, encoding='utf-8')
text = df['content']
tokenized_text = word_tokenize(text)
classified_text = st.tag(tokenized_text)
print (classified_text)
最佳答案
尝试申请
:
df['ne'] = df['content'].apply(get_continuous_chunks)
对于第二个示例中的代码,创建一个函数并以相同的方式应用它:
def my_st(text):
tokenized_text = word_tokenize(text)
return st.tag(tokenized_text)
df['st'] = df['content'].apply(my_st)
关于python - NLTK 命名实体识别数据集中的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/41456250/