python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框?

标签 python pandas dataframe

我发现有一个代码块在我的项目中很有用,但我无法让它以与打印相同的给定/所需格式(2 列)构建数据框。

代码块和所需的输出:

import nltk
import pandas as pd
nltk.download('punkt')
nltk.download('averaged_perceptron_tagger')
nltk.download('maxent_ne_chunker')
nltk.download('words')
 
# Step Two: Load Data
 
sentence = "Martin Luther King Jr. (born Michael King Jr.; January 15, 1929 – April 4, 1968) was an American Baptist minister and activist who became the most visible spokesman and leader in the American civil rights movement from 1955 until his assassination in 1968. King advanced civil rights through nonviolence and civil disobedience, inspired by his Christian beliefs and the nonviolent activism of Mahatma Gandhi. He was the son of early civil rights activist and minister Martin Luther King Sr."

# Step Three: Tokenise, find parts of speech and chunk words 

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        print(chunk.label(), ' '.join(c[0] for c in chunk))

一列中的标签和另一列中的实体的干净输出:

PERSON Martin
PERSON Luther King
PERSON Michael King
ORGANIZATION American
GPE American
GPE Christian
PERSON Mahatma Gandhi
PERSON Martin Luther

我尝试过类似的方法,但结果并不那么干净。

for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        df.append(chunk)

输出:

    [Tree('PERSON', [('Martin', 'NNP')]),
 Tree('PERSON', [('Luther', 'NNP'), ('King', 'NNP')]),
 Tree('PERSON', [('Michael', 'NNP'), ('King', 'NNP')]),
 Tree('ORGANIZATION', [('American', 'JJ')]),
 Tree('GPE', [('American', 'NNP')]),
 Tree('GPE', [('Christian', 'JJ')]),
 Tree('PERSON', [('Mahatma', 'NNP'), ('Gandhi', 'NNP')]),
 Tree('PERSON', [('Martin', 'NNP'), ('Luther', 'NNP')])]

有没有一种简单的方法可以将打印格式更改为仅 2 列的 df?

最佳答案

创建嵌套列表并转换为 DataFrame:

L = []
for sent in nltk.sent_tokenize(sentence):
  for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))):
     if hasattr(chunk, 'label'):
        L.append([chunk.label(), ' '.join(c[0] for c in chunk)])
        
df = pd.DataFrame(L, columns=['a','b'])
print (df)
              a               b
0        PERSON          Martin
1        PERSON     Luther King
2        PERSON    Michael King
3  ORGANIZATION        American
4           GPE        American
5           GPE       Christian
6        PERSON  Mahatma Gandhi
7        PERSON   Martin Luther

列表理解的解决方案是:

L= [[chunk.label(), ' '.join(c[0] for c in chunk)]  
     for sent in nltk.sent_tokenize(sentence) 
     for chunk in nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent))) 
     if hasattr(chunk, 'label')]

df = pd.DataFrame(L, columns=['a','b'])

关于python - 如何将这个格式奇怪的循环打印函数转换为具有类似输出的数据框?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70677140/

相关文章:

python - 响应 Listctrl 变化恰好一次

python - 行中列表的元素

python - 识别满足条件的 DataFrame 索引对象

python - 删除日期早于 "today"的行

python - Pyramid CORS 不提供 PUT 和 DELETE 服务

python - 对 pandas Series 的 k 个元素组应用函数

python - 如何在数据框中应用第 5 列的 cummax 逻辑

r - 旋转更宽会产生嵌套对象

python - 计算在 python 中只有列和多行的数据框的编辑距离

python - 如何从python中的文件中的多行中提取子字符串