我有购买数据,并想用一个新列来标记它们,该列提供有关白天购买的信息。为此,我使用每次购买的时间戳列中的小时。
标签应该像这样工作:
hour 4 - 7 => 'morning'
hour 8 - 11 => 'before midday'
...
我已经选择了时间戳的小时数。现在,我有一个包含 50 个 mio 记录的 DataFrame,如下所示。
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 13:37:47 13
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
目前我的方法是使用 6x .iterrows(),每个都有不同的条件:
for index, row in basket_times[(basket_times['hour'] >= 4) & (basket_times['hour'] < 8)].iterrows():
basket_times['periode'] = 'morning'
然后:
for index, row in basket_times[(basket_times['hour'] >= 8) & (basket_times['hour'] < 12)].iterrows():
basket_times['periode'] = 'before midday'
等等。
但是,50 个 mio 记录的 6 个循环中的一个已经花费了大约一个小时。有更好的方法吗?
最佳答案
您可以尝试loc
带有 bool 掩码。我更改了 df 进行测试:
print basket_times
user_id timestamp hour
0 11 2015-08-21 06:42:44 6
1 11 2015-08-20 13:38:58 13
2 11 2015-08-20 09:37:47 9
3 11 2015-08-21 06:59:05 6
4 11 2015-08-20 13:15:21 13
#create boolean masks
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
aftermidday = (basket_times['hour'] >= 11) & (basket_times['hour'] < 15)
print morning
0 True
1 False
2 False
3 True
4 False
Name: hour, dtype: bool
print beforemidday
0 False
1 False
2 True
3 False
4 False
Name: hour, dtype: bool
print aftermidday
0 False
1 True
2 False
3 False
4 True
Name: hour, dtype: bool
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
basket_times.loc[aftermidday, 'periode'] = 'after midday'
print basket_times
user_id timestamp hour periode
0 11 2015-08-21 06:42:44 6 morning
1 11 2015-08-20 13:38:58 13 after midday
2 11 2015-08-20 09:37:47 9 before midday
3 11 2015-08-21 06:59:05 6 morning
4 11 2015-08-20 13:15:21 13 after midday
计时 - len(df) = 500k
:
In [87]: %timeit a(df)
10 loops, best of 3: 34 ms per loop
In [88]: %timeit b(df1)
1 loops, best of 3: 490 ms per loop
测试代码:
import pandas as pd
import io
temp=u"""user_id;timestamp;hour
11;2015-08-21 06:42:44;6
11;2015-08-20 10:38:58;10
11;2015-08-20 09:37:47;9
11;2015-08-21 06:59:05;6
11;2015-08-20 10:15:21;10"""
#after testing replace io.StringIO(temp) to filename
df = pd.read_csv(io.StringIO(temp), sep=";", index_col=None, parse_dates=[1])
df = pd.concat([df]*100000).reset_index(drop=True)
print df.shape
#(500000, 3)
df1 = df.copy()
def a(basket_times):
morning = (basket_times['hour'] >= 4) & (basket_times['hour'] < 8)
beforemidday = (basket_times['hour'] >= 8) & (basket_times['hour'] < 11)
basket_times.loc[morning, 'periode'] = 'morning'
basket_times.loc[beforemidday, 'periode'] = 'before midday'
return basket_times
def b(basket_times):
def get_periode(hour):
if 4 <= hour <= 7:
return 'morning'
elif 8 <= hour <= 11:
return 'before midday'
basket_times['periode'] = basket_times['hour'].map(get_periode)
return basket_times
print a(df)
print b(df1)
关于python - 迭代 Pandas DataFrame,使用条件并添加列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35804933/