python - 优化代码以查找 DataFrame 中每行过去 30 天的值的中位数

标签 python pandas optimization dataframe time-series

我希望找到更快的代码来实现相同的目标:对于每一行,计算过去 30 天内所有数据的中位数。但数据点少于5个,则返回np.nan

import pandas as pd
import numpy as np
import datetime

def findPastVar(df, var='var' ,window=30, method='median'):
    # window= # of past days    
    def findPastVar_apply(row):
        pastVar = df[var].loc[(df['timestamp'] - row['timestamp'] < datetime.timedelta(days=0)) & (df['timestamp'] - row['timestamp'] > datetime.timedelta(days=-window))]
        if len(pastVar) < 5:
            return(np.nan)            
        if method == 'median':
            return(np.median(pastVar.values))
    df['past{}d_{}_median'.format(window,var)] = df.apply(findPastVar_apply,axis=1)
    return(df)


df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=100, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

数据看起来像这样。在我的真实数据中,存在时间间隙,并且一天内可能有更多数据点。

In [47]: df.head()
Out[47]: 
             timestamp       var
0  2011-01-01 00:00:00 -0.670695
1  2011-01-02 00:00:00  0.315148
2  2011-01-03 00:00:00 -0.717432
3  2011-01-04 00:00:00  2.904063
4  2011-01-05 00:00:00 -1.092813

期望的输出:

In [55]: df.head(10)
Out[55]: 
             timestamp       var  past30d_var_median
0  2011-01-01 00:00:00 -0.670695                 NaN
1  2011-01-02 00:00:00  0.315148                 NaN
2  2011-01-03 00:00:00 -0.717432                 NaN
3  2011-01-04 00:00:00  2.904063                 NaN
4  2011-01-05 00:00:00 -1.092813                 NaN
5  2011-01-06 00:00:00 -2.676784           -0.670695
6  2011-01-07 00:00:00 -0.353425           -0.694063
7  2011-01-08 00:00:00 -0.223442           -0.670695
8  2011-01-09 00:00:00  0.162126           -0.512060
9  2011-01-10 00:00:00  0.633801           -0.353425

但是,我当前的代码运行速度:

In [49]: %timeit findPastVar(df)
1 loop, best of 3: 755 ms per loop

我需要时不时地运行一个大的数据帧,所以我想优化这段代码。

欢迎任何建议或评论。

最佳答案

pandas 0.19 中的新功能是 time aware rolling 。它可以处理丢失的数据。

代码:

print(df.rolling('30d', on='timestamp', min_periods=5)['var'].median())

测试代码:

df = pd.DataFrame()
df['timestamp'] = pd.date_range('1/1/2011', periods=60, freq='D')
df['timestamp'] = df.timestamp.astype(pd.Timestamp)
df['var'] = pd.Series(np.random.randn(len(df['timestamp'])))

# duplicate one sample
df.timestamp.loc[50] = df.timestamp.loc[51]

# drop some data
df = df.drop(range(15, 50))

df['median'] = df.rolling(
    '30d', on='timestamp', min_periods=5)['var'].median()

结果:

              timestamp       var    median
0   2011-01-01 00:00:00 -0.639901       NaN
1   2011-01-02 00:00:00 -1.212541       NaN
2   2011-01-03 00:00:00  1.015730       NaN
3   2011-01-04 00:00:00 -0.203701       NaN
4   2011-01-05 00:00:00  0.319618 -0.203701
5   2011-01-06 00:00:00  1.272088  0.057958
6   2011-01-07 00:00:00  0.688965  0.319618
7   2011-01-08 00:00:00 -1.028438  0.057958
8   2011-01-09 00:00:00  1.418207  0.319618
9   2011-01-10 00:00:00  0.303839  0.311728
10  2011-01-11 00:00:00 -1.939277  0.303839
11  2011-01-12 00:00:00  1.052173  0.311728
12  2011-01-13 00:00:00  0.710270  0.319618
13  2011-01-14 00:00:00  1.080713  0.504291
14  2011-01-15 00:00:00  1.192859  0.688965
50  2011-02-21 00:00:00 -1.126879       NaN
51  2011-02-21 00:00:00  0.213635       NaN
52  2011-02-22 00:00:00 -1.357243       NaN
53  2011-02-23 00:00:00 -1.993216       NaN
54  2011-02-24 00:00:00  1.082374 -1.126879
55  2011-02-25 00:00:00  0.124840 -0.501019
56  2011-02-26 00:00:00 -0.136822 -0.136822
57  2011-02-27 00:00:00 -0.744386 -0.440604
58  2011-02-28 00:00:00 -1.960251 -0.744386
59  2011-03-01 00:00:00  0.041767 -0.440604

关于python - 优化代码以查找 DataFrame 中每行过去 30 天的值的中位数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/43969723/

相关文章:

python - 使用 prettytable 时如何获取单元格的值

javascript - 有没有办法以更优雅、更优化的方式编写这一小段代码? (ES6)

python - 删除 Pandas 中未使用类别的更快方法?

python - 为每个州获得 3 个最广阔的城市

c++ - 如何管理指向已引用对象内部数据的 shared_ptr?

python - Python 中使用行数作为输入变量拆分大文本文件的快速方法

python - httplib2 中的 SSL 版本 - EOF 发生在违反协议(protocol)的情况下

python - 根据条件延迟拆分 Iterable

python - 如果每行包含不同数量的字段(数量很大),如何正确读取 csv 文件?

python - 如何将pandas数据框中的列中具有不同值的行组合起来