我有一个像这样的数据框:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo",None,None,"doggo",None,None],
'floofer': ["floofer",None,None,"floofer",None,None,None],
'pupper': [None,None,"pupper",None,None,None,None],
'puppo':[None,None,None,None,None,None,"puppo"]})
我想合并最后 4 列并生成:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'categories': ["floofer","doggo","pupper","floofer","doggo",None,"puppo"]})
任何指导表示赞赏。
最佳答案
如果每行只有一个而不是每个分类列的 None
值的解决方案:
cols = ['doggo','floofer','pupper','puppo']
cols1 = df.columns.difference(cols)
df2 = df[cols1].join(df[cols].ffill(axis=1).iloc[:, -1].rename('Categories'))
print (df2)
id vote Categories
0 1 5 floofer
1 2 4 doggo
2 3 5 pupper
3 4 1 floofer
4 5 10 doggo
5 6 1 None
6 7 9 puppo
解释:
首先仅选择具有分类数据和正向填充缺失值的列 - 预期数据在最后一列:
print (df[cols].ffill(axis=1))
doggo floofer pupper puppo
0 None floofer floofer floofer
1 doggo doggo doggo doggo
2 None None pupper pupper
3 None floofer floofer floofer
4 doggo doggo doggo doggo
5 None None None None
6 None None None puppo
按位置选择最后一列:
print (df[cols].ffill(axis=1).iloc[:, -1])
0 floofer
1 doggo
2 pupper
3 floofer
4 doggo
5 None
6 puppo
Name: puppo, dtype: object
解决方案,如果多个值 - 数据是从分类列的列名创建的:
df = pd.DataFrame({'id':[1,2,3,4,5,6,7],
'vote':[5,4,5,1,10,1,9],
'doggo': [None,"doggo1",None,"doggo2","doggo3",None,None],
'floofer': ["floofer1",None,None,"floofer2",None,None,None],
'pupper': [None,None,"pupper1",None,None,None,None],
'puppo':["puppo1",None,None,None,None,None,"puppo2"]})
print (df)
id vote doggo floofer pupper puppo
0 1 5 None floofer1 None puppo1
1 2 4 doggo1 None None None
2 3 5 None None pupper1 None
3 4 1 doggo2 floofer2 None None
4 5 10 doggo3 None None None
5 6 1 None None None None
6 7 9 None None None puppo2
s = (df[cols].notnull()
.dot(pd.Index(cols) + ', ')
.str.strip(', ')
.rename('Categories')
.replace('', np.nan)
)
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer, puppo
1 2 4 doggo
2 3 5 pupper
3 4 1 doggo, floofer
4 5 10 doggo
5 6 1 NaN
6 7 9 puppo
另一个解决方案,预期的输出不是来自列名:
s = pd.Series(df[cols].add(', ').fillna('').values.sum(axis=1),
index=df.index, name='Categories').str.strip(', ')
df = df[cols1].join(s)
print (df)
id vote Categories
0 1 5 floofer1, puppo1
1 2 4 doggo1
2 3 5 pupper1
3 4 1 doggo2, floofer2
4 5 10 doggo3
5 6 1
6 7 9 puppo2
关于python - 组合不同的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53578054/