请考虑下面的“exampleDF”。
name age sex
a 21 male
b 13 female
c 56 female
d 12 male
e 45 nan
f 10 female
我想使用年龄和性别创建一个新列,因此如果年龄 < 15 newColumn
是 child
,否则它等于性别。
我已经尝试过了
exampleDF['newColumn'] = exampleDF[['age','sex']].apply(lambda age,sex: 'child' if age < 15 else sex)
但我收到错误缺少 1 个必需的位置参数:'sex'
请帮助我解决我做错的事情。
最佳答案
我认为更好的是使用 mask
- 如果 bool 掩码
中的True
从sex
列获取值,否则将child
字符串获取到新列:
print (exampleDF['age'] < 15)
0 False
1 True
2 False
3 True
4 False
5 True
Name: age, dtype: bool
exampleDF['newColumn'] = exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
print (exampleDF)
name age sex newColumn
0 a 21 male male
1 b 13 female child
2 c 56 female female
3 d 12 male child
4 e 45 NaN NaN
5 f 10 female child
该解决方案的主要优点是速度更快:
#small 6 rows df
In [63]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 517 µs per loop
In [64]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1000 loops, best of 3: 867 µs per loop
#bigger 6k df
exampleDF = pd.concat([exampleDF]*1000).reset_index(drop=True)
In [66]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
The slowest run took 5.41 times longer than the fastest. This could mean that an intermediate result is being cached.
1000 loops, best of 3: 589 µs per loop
In [67]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
10 loops, best of 3: 104 ms per loop
#bigger 60k df - apply very slow
exampleDF = pd.concat([exampleDF]*10000).reset_index(drop=True)
In [69]: %timeit exampleDF['sex'].mask(exampleDF['age'] < 15, 'child')
1000 loops, best of 3: 1.23 ms per loop
In [70]: %timeit exampleDF[['age','sex']].apply(lambda x: 'child' if x['age'] < 15 else x['sex'],axis=1)
1 loop, best of 3: 1.03 s per loop
关于python-3.x - 在 pandas 中使用 2 列应用函数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43462944/