python - 聚合 pandas 列中连续非 NaN 单元格的字符串，但不聚合整个列

我正在解决一个 nlp 问题，我必须分析格式奇怪的 Excel 文件。

有一列包含文本，其中每个文档跨越多个单元格。文档本身由空单元格分隔。我想从文本数据中预测其他列的分数。

This is what it looks like

我已将工作表导入 pandas 数据框，现在我尝试聚合属于每个文档的单元格，同时保留分数。

This is the goal state

我已经开始尝试嵌套循环，但我觉得它比必要的复杂得多。

你会如何处理这个问题？每个文档涵盖不同数量的单元格，并且文档由不同数量的空单元格分隔。为了使其更复杂，右侧列中的分数有时与相应文档的第一个单元格位于同一行，有时与最后一个单元格位于同一行。

非常感谢您的帮助!必须有一个简单的解决方案。

最佳答案

只是一个简单的例子，它是如何工作的:

import pandas as pd
# setting up the DataFrame with sample data
df = pd.DataFrame({'Document': ['This is ', 'first', None, 'This is ', 'second', `None, 'this ', 'is ', 'third'],`
                   'Score': [None, 1, None, None, 2, None, None, 3, None]})

result_df = pd.DataFrame({'Document':[], 'Score':[]})
doc = ''
for index, row in df.iterrows():
    if pd.notnull(row['Score']):
        #any not NaN value within processed document is score 
        score = row['Score']
    if row['Document']:
        #build doc string until the line is not NaN
        doc += row['Document']
    else:
        result_df = result_df.append({'Document':doc, 'Score':score}, ignore_index=True)
        doc = ''

if doc:
    #when the last line (Document) is not NaN save/print results also:
    result_df = result_df.append({'Document':doc, 'Score':score}, ignore_index=True)

输出(result_df):

Document    Score
0   This is first   1.0
1   This is second  2.0
2   This is third   3.0

关于python - 聚合 pandas 列中连续非 NaN 单元格的字符串，但不聚合整个列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/54110103/

python - 聚合 pandas 列中连续非 NaN 单元格的字符串，但不聚合整个列

上一篇：python - 如何读取S3中的ElasticSearch快照文件？

下一篇：python - 根据列表中嵌套字典中的值删除字典元素