python-3.x - 在 pandas 中使用 2 列应用函数

请考虑下面的“exampleDF”。

name age    sex
a    21     male
b    13   female
c    56     female
d    12     male
e    45     nan
f    10     female

我想使用年龄和性别创建一个新列，因此如果年龄 < 15 newColumn 是 child，否则它等于性别。

我已经尝试过了

exampleDF['newColumn'] = exampleDF[['age','sex']].apply(lambda age,sex: 'child' if age < 15 else sex)

但我收到错误缺少 1 个必需的位置参数:'sex'

请帮助我解决我做错的事情。

最佳答案

我认为更好的是使用 mask - 如果 bool 掩码中的True从sex列获取值，否则将child字符串获取到新列:

print (exampleDF['age'] < 15)
0    False
1     True
2    False
3     True
4    False
5     True
Name: age, dtype: bool


exampleDF['newColumn'] = exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
print (exampleDF)
  name  age     sex newColumn
0    a   21    male      male
1    b   13  female     child
2    c   56  female    female
3    d   12    male     child
4    e   45     NaN       NaN
5    f   10  female     child

该解决方案的主要优点是速度更快:

#small 6 rows df
In [63]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 517 µs per loop

In [64]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1000 loops, best of 3: 867 µs per loop

#bigger 6k df
exampleDF = pd.concat([exampleDF]*1000).reset_index(drop=True)

In [66]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
The slowest run took 5.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 589 µs per loop

In [67]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
10 loops, best of 3: 104 ms per loop

#bigger 60k df - apply very slow
exampleDF = pd.concat([exampleDF]*10000).reset_index(drop=True)

In [69]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 1.23 ms per loop

In [70]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1 loop, best of 3: 1.03 s per loop

关于python-3.x - 在 pandas 中使用 2 列应用函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/43462944/

python-3.x - 在 pandas 中使用 2 列应用函数

上一篇：asp.net-mvc-4 - 使用 dapper 将输出参数传递给接受数据表作为参数的 sp

下一篇：uml - 在 UML 中包含或扩展？