我发现有一个代码块在我的项目中很有用,但我无法让它以与打印相同的给定/所需格式(2 列)构建数据框。
代码块和所需的输出:
import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
# Step Two: Load Data
sentence = "Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr."
# Step Three: Tokenise, find parts of speech and chunk words
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
print(chunk.label(), ' '.join(c[0] for c in chunk))
一列中的标签和另一列中的实体的干净输出:
PERSON Martin
PERSON Luther King
PERSON Michael King
ORGANIZATION American
GPE American
GPE Christian
PERSON Mahatma Gandhi
PERSON Martin Luther
我尝试过类似的方法,但结果并不那么干净。
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
df.append(chunk)
输出:
[Tree('PERSON', [('Martin', 'NNP')]),
Tree('PERSON', [('Luther', 'NNP'), ('King', 'NNP')]),
Tree('PERSON', [('Michael', 'NNP'), ('King', 'NNP')]),
Tree('ORGANIZATION', [('American', 'JJ')]),
Tree('GPE', [('American', 'NNP')]),
Tree('GPE', [('Christian', 'JJ')]),
Tree('PERSON', [('Mahatma', 'NNP'), ('Gandhi', 'NNP')]),
Tree('PERSON', [('Martin', 'NNP'), ('Luther', 'NNP')])]
有没有一种简单的方法可以将打印格式更改为仅 2 列的 df?
最佳答案
创建嵌套列表并转换为 DataFrame:
L = []
for sent in nltk.sent_tokenize(sentence):
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
if hasattr(chunk, 'label'):
L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
df = pd.DataFrame(L, columns=['a','b'])
print (df)
a b
0 PERSON Martin
1 PERSON Luther King
2 PERSON Michael King
3 ORGANIZATION American
4 GPE American
5 GPE Christian
6 PERSON Mahatma Gandhi
7 PERSON Martin Luther
列表理解的解决方案是:
L= [[chunk.label(), ' '.join(c[0] for c in chunk)]
for sent in nltk.sent_tokenize(sentence)
for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)))
if hasattr(chunk, 'label')]
df = pd.DataFrame(L, columns=['a','b'])
关于python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70677140/