python - 根据先前的列集创建多个新列(更有效)

标签 python pandas dataframe

对于我的数据集,我想创建一些新列。这些列由一个比率组成,该比率基于其他两列。这是我的意思的一个例子:


import random
col1=[0,0,0,0,2,4,6,0,0,0,100,200,300,400]
col2=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]

d = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6], 
 'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 'col1' : col1, 'col2' : col2 }
df = pd.DataFrame(data=d)

new_df = df.groupby(['Unit', 'Year']).sum()

new_df['col1/col2'] = (new_df.groupby(level=0, group_keys=False)
                  .apply(lambda x: x.col1/x.col2.shift())
                 )

           col1  col2      col1/col2
Unit Year                      
1    2014     0     0       NaN
     2015     0     0       NaN
     2016     0     0       NaN
     2017     0     0       NaN
2    2015     2     4       NaN
     2016     4     6  1.000000
     2017     6     8  1.000000
3    2017     0     0       NaN
4    2014     0     0       NaN
5    2015     0     0       NaN
6    2014   100   200       NaN
     2015   200   900  1.000000
     2016   300   400  0.333333
     2017   400   500  1.000000

但是,这是一个 super 简化的 df。实际上,我有第 1 到 50 列。我现在所做的事情感觉非常低效:

col1=[0,0,0,0,2,4,6,0,0,0,100,200,300,400]
col2=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
col3=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
col4=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
col5=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]
col6=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]

# data in all cols is the same, just for example.

d = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6], 
 'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 'col1' : col1, 'col2' : col2, 'col3' : col3, 'col4' : col4, 'col5' : col5, 'col6' : col6}
df = pd.DataFrame(data=d)

new_df = df.groupby(['Unit', 'Year']).sum()

new_df['col1/col2'] = (new_df.groupby(level=0, group_keys=False)
                  .apply(lambda x: x.col1/x.col2.shift())
                 )
new_df['col3/col4'] = (new_df.groupby(level=0, group_keys=False)
                  .apply(lambda x: x.col3/x.col4.shift())
                 )
new_df['col5/col6'] = (new_df.groupby(level=0, group_keys=False)
                  .apply(lambda x: x.col5/x.col6.shift())
                 )


我将创建新列的方法执行了 25 次。这样可以更高效地完成吗/

提前谢谢您,

最佳答案

想法是使用DataFrameGroupBy.shift按列表 cols2 中的所有列划分,并按列表 cols1 筛选 DataFrame:

col1=[0,0,0,0,2,4,6,0,0,0,100,200,300,400]
col2=[0,0,0,0,4,6,8,0,0,0,200,900,400, 500]


d = {'Unit': [1, 1, 1, 1, 2, 2, 2, 3, 4, 5, 6, 6, 6, 6], 
 'Year': [2014, 2015, 2016, 2017, 2015, 2016, 2017, 2017, 2014, 2015, 2014, 2015, 2016, 2017], 
 'col1' : col1, 'col2' : col2 , 
 'col3' : col1, 'col4' : col2 , 
 'col5' : col1, 'col6' : col2 }
df = pd.DataFrame(data=d)

new_df = df.groupby(['Unit', 'Year']).sum()

cols1 = ['col1','col3','col5']
cols2 = ['col2','col4','col6']
new_df = new_df[cols1] / new_df.groupby(level=0)[cols2].shift().values
new_df.columns = [f'{a}/{b}' for a, b in zip(cols1, cols2)]          
print (new_df)
           col1/col2  col3/col4  col5/col6
Unit Year                                 
1    2014        NaN        NaN        NaN
     2015        NaN        NaN        NaN
     2016        NaN        NaN        NaN
     2017        NaN        NaN        NaN
2    2015        NaN        NaN        NaN
     2016   1.000000   1.000000   1.000000
     2017   1.000000   1.000000   1.000000
3    2017        NaN        NaN        NaN
4    2014        NaN        NaN        NaN
5    2015        NaN        NaN        NaN
6    2014        NaN        NaN        NaN
     2015   1.000000   1.000000   1.000000
     2016   0.333333   0.333333   0.333333
     2017   1.000000   1.000000   1.000000

关于python - 根据先前的列集创建多个新列(更有效),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56750945/

相关文章:

python - 如何使用 Flask test_client 设置请求参数?

python - Theano 无法在 Windows 上运行

python - 在 Python 中使用 MDP 进行因素分析

python - pandas 列相关性具有统计显着性

python - Pyspark 数据框使用默认值左连接

python - 有史以来最没有帮助的错误 : TypeError: unhashable type: 'list'

python - pandas 比 argsort 更快的方式在数据框子集中排名

python - 根据行的先前值填充 NaN

python - Pandas DataFrames 充当另一个 DataFrame 的事件 View

r - 对列表中每个 data.frame 的列进行求和,给出唯一包含总和的数据框