我有以下数据框
df1 = pd.DataFrame(
{
"day": ["monday", "monday","Tuesday" ],
"column0": ["xx", "xx", ""],
"column1": ["yy", "aa", "bb"],
"column2": ["cc", "cc", "cc"],
"column3": ["cc", "", "aa"]})
day column0 column1 column2 column3
0 monday xx yy cc cc
1 monday xx aa cc
2 Tuesday bb cc aa
我想按天分组并将列连接成行,并将行保留为索引列
预期结果1:
df1 = pd.DataFrame(
{
"day": ["monday", "Tuesday" ],
"index": ["0,1", "2" ],
"column0": ["xx", ""],
"column1": ["yy", "bb"],
"column2": ["cc", "cc"],
"column3": ["cc", "aa"],
"column4": ["xx", ""],
"column5": ["aa", ""],
"column6": ["cc", ""]})
day index column0 column1 column2 column3 column4 column5 column6
0 monday 0,1 xx yy cc cc xx aa cc
1 Tuesday 2 bb cc aa
最后,我想删除每一行的相同值,并将 NAN 添加到空白列
最终结果:
df1 = pd.DataFrame(
{
"day": ["monday", "Tuesday" ],
"index": ["0,1", "2" ],
"column0": ["xx", "NAN"],
"column1": ["yy", "bb"],
"column2": ["cc", "cc"],
"column3": ["NAN", "aa"],
"column5": ["aa", "NAN"]})
day index column0 column1 column2 column3 column4
0 monday 0,1 xx yy cc NAN aa
1 Tuesday 2 NAN bb cc aa NAN
有什么想法吗?
最佳答案
您可以使用 numpy 来展平分组数据框。然后将它们存储在一个列表中,并从中制作一个数据框。
您最终可以将 ""
和 None
替换为 NaN
,删除 NaN
列并重命名您的列:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"day": ["monday", "monday","Tuesday" ],
"column0": ["xx", "xx", ""],
"column1": ["yy", "aa", "bb"],
"column2": ["cc", "cc", "cc"],
"column3": ["cc", "", "aa"]})
arr_list = []
for d, sub_df in df1.groupby("day"):
arr = list(np.array(sub_df.iloc[:,1:]).flatten())
arr = [d, list(sub_df.index)] + arr
arr_list.append(arr)
df = pd.DataFrame(arr_list)
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)
输出:
day index column0 column1 column2 column3 column4 column5 column6
0 Tuesday [2] NaN bb cc aa NaN NaN NaN
1 monday [0, 1] xx yy cc cc xx aa cc
编辑:如果您想删除每行中的重复项,请在展平数组后执行此操作。
您还可以通过在 groupby
中指定 sort=False
来保持原始顺序:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"day": ["monday", "monday","Tuesday" ],
"column0": ["xx", "xx", ""],
"column1": ["yy", "aa", "bb"],
"column2": ["cc", "cc", "cc"],
"column3": ["cc", "", "aa"]})
arr_list = []
for d, sub_df in df1.groupby("day", sort=False):
# flattening the grouped dataframe ([:,1:] => all rows, all column except the first one: day)
arr = list(np.array(sub_df.iloc[:,1:]).flatten())
# removing duplicates for this row:
arr_unique = []
for x in arr:
if not x in arr_unique:
arr_unique.append(x)
else: # appending NaN to keep dataframe form
arr_unique.append(np.nan)
# re-appending day and adding the indexes of the grouped rows:
arr = [d, list(sub_df.index)] + arr_unique
arr_list.append(arr)
df = pd.DataFrame(arr_list)
# replacing '' with NaN and dropping NaN columns:
df = df.replace('',np.nan).fillna(value=np.nan).dropna(axis=1, how='all')
# renaming columns, the first two are 'day' and 'index' the rest is generated: columnX where X goes from 0 to the nb of column minus 2 (since we already named two columns)
df.columns = ["day", "index"] + [f"column{i}" for i in range(len(df.columns)-2)]
print(df)
输出:
day index column0 column1 column2 column3 column4
0 monday [0, 1] xx yy cc NaN aa
1 Tuesday [2] NaN bb cc aa NaN
关于python - 按特定列将不同行的列合并为一个行组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/70202668/