python - 迭代 Pandas DataFrame,使用条件并添加列

标签 python pandas

我有购买数据,并想用一个新列来标记它们,该列提供有关白天购买的信息。为此,我使用每次购买的时间戳列中的小时。

标签应该像这样工作:

 hour 4 - 7 => 'morning'
 hour 8 - 11 => 'before midday'
 ...

我已经选择了时间戳的小时数。现在,我有一个包含 50 个 mio 记录的 DataFrame,如下所示。

    user_id  timestamp              hour
0   11       2015-08-21 06:42:44    6
1   11       2015-08-20 13:38:58    13
2   11       2015-08-20 13:37:47    13
3   11       2015-08-21 06:59:05    6
4   11       2015-08-20 13:15:21    13

目前我的方法是使用 6x .iterrows(),每个都有不同的条件:

for index, row in basket_times[(basket_times['hour']  >= 4) & (basket_times['hour'] < 8)].iterrows():
    basket_times['periode'] = 'morning'

然后:

for index, row in basket_times[(basket_times['hour']  >= 8) & (basket_times['hour'] < 12)].iterrows():
    basket_times['periode'] = 'before midday'

等等。

但是,50 个 mio 记录的 6 个循环中的一个已经花费了大约一个小时。有更好的方法吗?

最佳答案

您可以尝试loc带有 bool 掩码。我更改了 df 进行测试:

print basket_times
   user_id           timestamp  hour
0       11 2015-08-21 06:42:44     6
1       11 2015-08-20 13:38:58    13
2       11 2015-08-20 09:37:47     9
3       11 2015-08-21 06:59:05     6
4       11 2015-08-20 13:15:21    13

#create boolean masks
morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour']  >= 11) & (basket_times['hour'] < 15)
print morning
0     True
1    False
2    False
3     True
4    False
Name: hour, dtype: bool

print beforemidday
0    False
1    False
2     True
3    False
4    False
Name: hour, dtype: bool
print aftermidday
0    False
1     True
2    False
3    False
4     True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
   user_id           timestamp  hour        periode
0       11 2015-08-21 06:42:44     6        morning
1       11 2015-08-20 13:38:58    13   after midday
2       11 2015-08-20 09:37:47     9  before midday
3       11 2015-08-21 06:59:05     6        morning
4       11 2015-08-20 13:15:21    13   after midday

计时 - len(df) = 500k:

In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop

In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop

测试代码:

import pandas as pd
import io

temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()

def a(basket_times):
    morning = (basket_times['hour']  >= 4) & (basket_times['hour'] < 8)
    beforemidday = (basket_times['hour']  >= 8) & (basket_times['hour'] < 11)
    basket_times.loc[morning, 'periode'] = 'morning'
    basket_times.loc[beforemidday, 'periode'] = 'before midday'
    return basket_times

def b(basket_times):
    def get_periode(hour):
        if 4 <= hour <= 7:
            return 'morning'
        elif 8 <= hour <= 11:
            return 'before midday'

    basket_times['periode'] = basket_times['hour'].map(get_periode)
    return basket_times

print a(df)    
print b(df1)    

关于python - 迭代 Pandas DataFrame,使用条件并添加列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35804933/

相关文章:

python - 如何使用 Django REST 序列化程序对保留键进行验证?

python - 网络爬虫在列表之间提取

python - MySQL 检查一周中每一天的营业时间是否存在

python - Pandas - 识别以列表中的值开头的数据框值

python - 属性错误: 'Timestamp' object has no attribute 'translate'

python - Pymunk (花栗鼠) - 如何暂时关闭具体对象的物理/碰撞

Python 正则表达式 交替运算符后没有组

python - 将 Flask 上传的文件读取到 pandas 数据框中时找不到文件

python - 来自 Dataframes Dict 的 Pandas 面板返回 NaN

pandas read_sql 的 Python 编码问题