python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框？

我发现有一个代码块在我的项目中很有用，但我无法让它以与打印相同的给定/所需格式(2 列)构建数据框。

代码块和所需的输出:

import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# Step Two: Load Data
 
sentence = "Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr."

# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

一列中的标签和另一列中的实体的干净输出:

PERSON Martin
PERSON Luther King
PERSON Michael King
ORGANIZATION American
GPE American
GPE Christian
PERSON Mahatma Gandhi
PERSON Martin Luther

我尝试过类似的方法，但结果并不那么干净。

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        df.append(chunk)

输出:

    [Tree('PERSON', [('Martin', 'NNP')]),
 Tree('PERSON', [('Luther', 'NNP'), ('King', 'NNP')]),
 Tree('PERSON', [('Michael', 'NNP'), ('King', 'NNP')]),
 Tree('ORGANIZATION', [('American', 'JJ')]),
 Tree('GPE', [('American', 'NNP')]),
 Tree('GPE', [('Christian', 'JJ')]),
 Tree('PERSON', [('Mahatma', 'NNP'), ('Gandhi', 'NNP')]),
 Tree('PERSON', [('Martin', 'NNP'), ('Luther', 'NNP')])]

有没有一种简单的方法可以将打印格式更改为仅 2 列的 df？

最佳答案

创建嵌套列表并转换为 DataFrame:

L = []
for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
        
df = pd.DataFrame(L, columns=['a','b'])
print (df)
              a               b
0        PERSON          Martin
1        PERSON     Luther King
2        PERSON    Michael King
3  ORGANIZATION        American
4           GPE        American
5           GPE       Christian
6        PERSON  Mahatma Gandhi
7        PERSON   Martin Luther

列表理解的解决方案是:

L= [[chunk.label(), ' '.join(c[0] for c in chunk)]  
     for sent in nltk.sent_tokenize(sentence) 
     for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))) 
     if hasattr(chunk, 'label')]

df = pd.DataFrame(L, columns=['a','b'])

关于python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70677140/

python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框？

上一篇：python - 突然导入错误: cannot import name "QtCore" from "PyQt5"

下一篇：c - 打印链表中的节点时无限循环