我有一些令人困惑的操作,试图使用以下一般形式在数据集上有效地完成:
id,date,ind_1,ind_2,ind_3,ind_4
1,2014-01-01,ind_1,NaN,NaN,NaN
2,2014-01-02,ind_1,NaN,ind_3,NaN
3,2014-01-03,ind_1,ind_2,ind_3,NaN
我试图弄清楚如何创建一个新列“ind_all”,其中填充任何非空“ind”列。这很简单。我可以使用.idxmax()。然而,棘手的部分是我每行可以有多个“ind”。这意味着当存在重复项时我需要创建一条新记录。上面的例子最终应该看起来像这样:
id,date,ind_1,ind_2,ind_3,ind_4,ind_all
1,2014-01-01,ind_1,NaN,NaN,NaN,ind_1
2,2014-01-02,ind_1,NaN,ind_3,NaN,ind_1
2,2014-01-02,ind_1,NaN,ind_3,NaN,ind_3
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_1
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_2
3,2014-01-03,ind_1,ind_2,ind_3,NaN,ind_3
一如既往,我们非常感谢任何提示或技巧!
最佳答案
有一个基于merge
的解决方案,使用melt
/stack
来构建RHS。
v = (df.drop('date', 1)
.melt('id')
.drop('variable', 1)
.dropna()
.rename({'value' : 'ind_all'}, axis=1)
)
df.merge(v)
id date ind_1 ind_2 ind_3 ind_4 ind_all
0 1 2014-01-01 ind_1 NaN NaN NaN ind_1
1 2 2014-01-02 ind_1 NaN ind_3 NaN ind_1
2 2 2014-01-02 ind_1 NaN ind_3 NaN ind_3
3 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_1
4 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_2
5 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_3
或者,
df.merge(df.drop('date', 1)
.set_index('id')
.stack()
.reset_index(1, drop=True)
.to_frame('ind_all'),
left_on='id',
right_index=True
)
id date ind_1 ind_2 ind_3 ind_4 ind_all
0 1 2014-01-01 ind_1 NaN NaN NaN ind_1
1 2 2014-01-02 ind_1 NaN ind_3 NaN ind_1
1 2 2014-01-02 ind_1 NaN ind_3 NaN ind_3
2 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_1
2 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_2
2 3 2014-01-03 ind_1 ind_2 ind_3 NaN ind_3
关于python - 用其他列值填充 NaN 列,复制新行,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50993025/