python - 按特定列将不同行的列合并为一个行组

标签 python python-3.x pandas dataframe pandas-groupby

我有以下数据框

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})


    day    column0  column1 column2 column3
0   monday  xx       yy       cc      cc
1   monday  xx       aa       cc    
2   Tuesday          bb       cc      aa

我想按天分组并将列连接成行,并将行保留为索引列

预期结果1:

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",             ""],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["cc",           "aa"],
        "column4": ["xx",             ""],
        "column5": ["aa",             ""],
        "column6": ["cc",             ""]})

    day   index column0 column1 column2 column3 column4 column5 column6
0   monday  0,1   xx       yy     cc      cc      xx      aa    cc
1   Tuesday 2              bb     cc      aa            

最后,我想删除每一行的相同值,并将 NAN 添加到空白列

最终结果:

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "Tuesday" ],
        "index":   ["0,1",          "2" ],
        "column0": ["xx",          "NAN"],
        "column1": ["yy",           "bb"],
        "column2": ["cc",           "cc"],
        "column3": ["NAN",          "aa"],
        "column5": ["aa",          "NAN"]})

    day   index column0 column1 column2  column3    column4
0   monday  0,1   xx      yy          cc    NAN       aa
1   Tuesday 2    NAN      bb          cc    aa        NAN

有什么想法吗?

最佳答案

您可以使用 numpy 来展平分组数据框。然后将它们存储在一个列表中,并从中制作一个数据框。

您最终可以将 ""None 替换为 NaN,删除 NaN 列并重命名您的列:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day"):
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  arr = [d, list(sub_df.index)] + arr
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出:

       day   index column0 column1 column2 column3 column4 column5 column6
0  Tuesday     [2]     NaN      bb      cc      aa     NaN     NaN     NaN
1   monday  [0, 1]      xx      yy      cc      cc      xx      aa      cc
编辑:如果您想删除每行中的重复项,请在展平数组后执行此操作。

您还可以通过在 groupby 中指定 sort=False 来保持原始顺序:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {   
        "day":     ["monday", "monday","Tuesday" ],
        "column0": ["xx",      "xx",     ""],
        "column1": ["yy",      "aa",    "bb"],
        "column2": ["cc",      "cc",    "cc"],
        "column3": ["cc",      "",      "aa"]})

arr_list = []
for d, sub_df in df1.groupby("day", sort=False):
  # flattening the grouped dataframe ([:,1:] => all rows, all column except the first one: day)
  arr = list(np.array(sub_df.iloc[:,1:]).flatten())
  # removing duplicates for this row:
  arr_unique = []
  for x in arr:
    if not x in arr_unique:
      arr_unique.append(x)
    else: # appending NaN to keep dataframe form
      arr_unique.append(np.nan)
  # re-appending day and adding the indexes of the grouped rows:
  arr = [d, list(sub_df.index)] + arr_unique
  arr_list.append(arr)

df = pd.DataFrame(arr_list)
# replacing '' with NaN and dropping NaN columns:
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
# renaming columns, the first two are 'day' and 'index' the rest is generated: columnX where X goes from 0 to the nb of column minus 2 (since we already named two columns)
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)

输出:

       day   index column0 column1 column2 column3 column4
0   monday  [0, 1]      xx      yy      cc     NaN      aa
1  Tuesday     [2]     NaN      bb      cc      aa     NaN

关于python - 按特定列将不同行的列合并为一个行组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70202668/

相关文章:

python - 独立脚本还是作为模块?

Python:规范化 pandas DataFrame 的一些列

python-3.x - 需要找到完全匹配并使用正则表达式替换

python - 转换具有 numpy 数组的列将其转换为以 dtype 作为对象的 numpy 数组

python - Pandas 迭代索引并附加剩余行

python - 一个多余但很酷的动态 python 结构以及关于装饰器、推导式、语法的问题

python - 在 python 应用程序中结合 websockets 和 WSGI

python - 在PC上没有安装python的情况下运行python脚本

python - 在 Python 中移动和合并列表元素的最有效方法 (2048)

python - 带折线图的条形图 - 使用非数字索引