我的 previous question 的扩展.我有一个源数据框,它包含三列:客户、日期和项目。我想添加一个包含项目历史记录的新列,该列是该客户在较早(由日期定义)行中的所有项目的数组。 如果客户在同一日期进行了多次购买,则任何一行的项目都不应列在另一行的项目历史记录中。
因此,给定此样本数据:
df = pd.DataFrame({'Customer':['Bert', 'Bert', 'Bert', 'Bert', 'Bert', 'Ernie', 'Ernie', 'Ernie', 'Ernie', 'Steven', 'Steven'], 'Date':['01/01/2019', '15/01/2019', '20/01/2019', '20/01/2019', '22/01/2019', '01/01/2019', '15/01/2019', '20/01/2019', '22/01/2019', '01/01/2019' ,'15/01/2019'], 'Item':['Bread', 'Cheese', 'Apples', 'Pears', 'Toothbrush', 'Toys', 'Shellfish', 'Dog', 'Yoghurt', 'Toilet', 'Dominos']})
Customer Date Item
Bert 01/01/2019 Bread
Bert 15/01/2019 Cheese
Bert 20/01/2019 Apples
Bert 20/01/2019 Pears
Bert 22/01/2019 Toothbrush
Ernie 01/01/2019 Toys
Ernie 15/01/2019 Shellfish
Ernie 20/01/2019 Dog
Ernie 22/01/2019 Yoghurt
Steven 01/01/2019 Toilet
Steven 15/01/2019 Dominos
我希望看到的输出是:
Customer Date Item Item History
Bert 01/01/2019 Bread NaN
Bert 15/01/2019 Cheese [Bread]
Bert 20/01/2019 Apples [Bread, Cheese]
Bert 20/01/2019 Pears [Bread, Cheese]
Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
Ernie 01/01/2019 Toys NaN
Ernie 15/01/2019 Shellfish [Toys]
Ernie 20/01/2019 Dog [Toys, Shellfish]
Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
Steven 01/01/2019 Toilet NaN
Steven 15/01/2019 Dominos [Toilet]
请注意,对于 Bert 在 2019 年 1 月 20 日购买的商品,历史记录列中都不包含对方的商品。对于他在 22/01/2019 购买的商品,包括 20/01/2019 的两件商品。
上一个问题的答案是对列表的理解,形式如下:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
但显然 x.Item[:i]
中的“i”需要计算日期与当前行不同的最后一行。非常感谢任何关于实现这一目标的建议。
最佳答案
想法是通过 DataFrame.duplicated
区分每组的重复值然后用前向填充缺失值将值替换为 NaN
。
每组的第一个值总是空字符串,因此没有必要按组替换:
df['Item History'] = [x.Item[:i].tolist() for j, x in df.groupby('Customer')
for i in range(len(x))]
df['Item History'] = df['Item History'].mask(df.duplicated(['Customer','Date'])).ffill()
df.loc[~df['Item History'].astype(bool), 'Item History']= np.nan
print (df)
Customer Date Item Item History
0 Bert 01/01/2019 Bread NaN
1 Bert 15/01/2019 Cheese [Bread]
2 Bert 20/01/2019 Apples [Bread, Cheese]
3 Bert 20/01/2019 Pears [Bread, Cheese]
4 Bert 22/01/2019 Toothbrush [Bread, Cheese, Apples, Pears]
5 Ernie 01/01/2019 Toys NaN
6 Ernie 15/01/2019 Shellfish [Toys]
7 Ernie 20/01/2019 Dog [Toys, Shellfish]
8 Ernie 22/01/2019 Yoghurt [Toys, Shellfish, Dog]
9 Steven 01/01/2019 Toilet NaN
10 Steven 15/01/2019 Dominos [Toilet]
关于python - 将 Pandas 数据框中具有不同日期的较早行的值连接起来,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58007296/