python - 如何有效地对 Pandas 数据框的一行值求和

标签 python performance pandas numpy dataframe

我有一个包含 150 万行和 8 列的 python 数据框。我想合并几列并创建一个新列。我知道如何做到这一点,但想知道哪个更快、更高效。我在这里重现我的代码

import pandas as pd
import numpy as np
df=pd.Dataframe(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])

这就是我想要实现的目标

df['D']=0.5*df['A']+0.3*df['B']+0.2*df['C']

另一种选择是使用 pandas 的 apply 功能

df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'])

我想知道当我们有 150 万行并且必须合并 8 列时,哪种方法花费的时间更少

最佳答案

第一种方法更快,因为它是矢量化的:

df=pd.DataFrame(columns=['A','B','C'],data=[[1,2,3],[4,5,6],[7,8,9]])
print (df)

#[30000 rows x 3 columns]
df = pd.concat([df]*10000).reset_index(drop=True)

df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
#similar timings with mul function
#df['D1']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)

df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)

print (df)

In [54]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
The slowest run took 10.84 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 950 µs per loop

In [55]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
The slowest run took 4.76 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 1.2 ms per loop

In [56]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 928 ms per loop

另一个1.5M大小的DataFrame测试,apply方法很慢:

#[1500000 rows x 6 columns]
df = pd.concat([df]*500000).reset_index(drop=True)

In [62]: %timeit df['D2']=df['A'].mul(0.5)+df['B'].mul(0.3)+df['C'].mul(0.2)
10 loops, best of 3: 34.8 ms per loop

In [63]: %timeit df['D1']=0.5*df['A']+0.3*df['B']+0.2*df['C']
10 loops, best of 3: 31.5 ms per loop

In [64]: %timeit df['D']=df.apply(lambda row: 0.5*row['A']+0.3*row['B']+0.2*row['C'], axis=1)
1 loop, best of 3: 47.3 s per loop

关于python - 如何有效地对 Pandas 数据框的一行值求和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39445111/

相关文章:

Python <p = Popen( ["command"], stdout=PIPE)> 当 p.stdout.read(-1) 被阻止时

python - 如何在终端上运行 Python 脚本?

ios - UITableViewCell 在 subview 性能问题中调用 .layer.maskToBounds = YES

mysql - 选择计数(*) 性能

python - 字符串性能 - Windows 10 与 Ubuntu 下的 Python 2.7 与 Python 3.4

python - Pandas - 计算与固定项目相比的相对差异

python - 如何将 vincenty 距离转换为在 pandas 数据框中 float

python - 在 Amazon Linux 上安装 Tkinter

python Pandas : how to avoid chained assignment

python - Pygame Font.render() 裁剪带有尖音符号、波形符或抑扬符的大写字母