python - Pandas :与模式匹配的固定补丁滚动相关

标签 python numpy pandas pattern-matching correlation

新年快乐。

我正在寻找一种方法来计算滚动窗口和固定窗口(“补丁”)与 Pandas 的相关性。最终目标是进行模式匹配。

根据我在文档中阅读的内容,希望我遗漏了一些东西,corr() 或 corrwith() 不允许您锁定其中一个系列/数据帧。

目前我能想到的最好的蹩脚解决方案如下所列。当它在 50K 行上运行时,包含 30 个样本的补丁,处理时间进入 Ctrl+C 范围。

我非常感谢您的建议和替代方案。谢谢。

请运行下面的代码,它会很清楚我要做什么:

import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame

# Create test DataFrame df and a patch to be found.
n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)

n = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=n, freq='5min')
patch = DataFrame(np.arange(n), columns=['a'], index=rng)

print
print '    *** Start corr example ***'
# To avoid the automatic alignment between df and patch, 
# I need to reset the index.
patch.reset_index(inplace=True, drop=True)
# Cannot do:
#    df.reset_index(inplace=True, drop=True)

df['corr'] = np.nan

for i in range(df.shape[0]):
    window = df[i : i+patch.shape[0]]
    # If slice has only two rows, I have a line between two points
    # When I corr with to points in patch, I start getting 
    # misleading values like 1 or -1
    if window.shape[0] != patch.shape[0] :
        break
    else:
        # I need to reset_index for the window, 
        # which is less efficient than doing outside the 
        # for loop where the patch has its reset_index done.
        # If I would do the df.reset_index up there, 
        # I would still have automatic realignment but
        # by index.
        window.reset_index(inplace=True, drop=True)

        # On top of the obvious inefficiency
        # of this method, I cannot just corrwith()
        # between specific columns in the dataframe;
        # corrwith() runs for all.
        # Alternatively I could create a new DataFrame
        # only with the needed columns:
        #     df_col = DataFrame(df.a)
        #     patch_col = DataFrame(patch.a)
        # Alternatively I could join the patch to
        # the df and shift it.
        corr = window.corrwith(patch)

        print
        print '==========================='
        print 'window:'
        print window
        print '---------------------------'
        print 'patch:'
        print patch
        print '---------------------------'
        print 'Corr for this window'
        print corr
        print '============================'

        df['corr'][i] = corr.a

print
print '    *** End corr example ***'
print " Please inspect var 'df'"
print

最佳答案

显然,reset_index 的大量使用表明我们正在与 Panda 的索引和自动对齐作斗争。哦,如果我们可以忘记索引,事情会容易得多! 事实上,这就是 NumPy 的用途。 (一般来说,需要对齐或按索引分组时使用Pandas,计算N维数组时使用NumPy。)

使用 NumPy 将使计算速度更快,因为我们将能够删除 for-loop 并将 for-loop 中完成的所有计算作为一次计算处理在滚动窗口的 NumPy 数组上完成。

我们可以look inside pandas/core/frame.py's DataFrame.corrwith看看计算是如何完成的。然后将其转换为在 NumPy 数组上完成的相应代码,根据需要进行调整,因为我们希望在整个数组上进行计算,而不是一次只对一个窗口进行计算,同时保持 patch常量。 (注意:Pandas corrwith 方法处理 NaN。为了使代码更简单,我假设输入中没有 NaN。)

import numpy as np
import pandas as pd
from pandas import Series
from pandas import DataFrame
import numpy.lib.stride_tricks as stride
np.random.seed(1)

n = 10
rng = pd.date_range('1/1/2000 00:00:00', periods=n, freq='5min')
df = DataFrame(np.random.rand(n, 1), columns=['a'], index=rng)

m = 4
rng = pd.date_range('1/1/2000 00:10:00', periods=m, freq='5min')
patch = DataFrame(np.arange(m), columns=['a'], index=rng)

def orig(df, patch):
    patch.reset_index(inplace=True, drop=True)

    df['corr'] = np.nan

    for i in range(df.shape[0]):
        window = df[i : i+patch.shape[0]]
        if window.shape[0] != patch.shape[0] :
            break
        else:
            window.reset_index(inplace=True, drop=True)
            corr = window.corrwith(patch)

            df['corr'][i] = corr.a

    return df

def using_numpy(df, patch):
    left = df['a'].values
    itemsize = left.itemsize
    left = stride.as_strided(left, shape=(n-m+1, m), strides = (itemsize, itemsize))

    right = patch['a'].values

    ldem = left - left.mean(axis=1)[:, None]
    rdem = right - right.mean()

    num = (ldem * rdem).sum(axis=1)
    dom = (m - 1) * np.sqrt(left.var(axis=1, ddof=1) * right.var(ddof=1))
    correl = num/dom

    df.ix[:len(correl), 'corr'] = correl
    return df

expected = orig(df.copy(), patch.copy())
result = using_numpy(df.copy(), patch.copy())

print(expected)
print(result)

这证实了 origusing_numpy 生成的值是 相同的:

assert np.allclose(expected['corr'].dropna(), result['corr'].dropna())

技术说明:

为了以内存友好的方式创建充满滚动窗口的数组,我 used a striding trick I learned here .


这是一个基准测试,使用 n, m = 1000, 4(很多行和一个小补丁来生成很多窗口):

In [77]: %timeit orig(df.copy(), patch.copy())
1 loops, best of 3: 3.56 s per loop

In [78]: %timeit using_numpy(df.copy(), patch.copy())
1000 loops, best of 3: 1.35 ms per loop

-- 2600 倍的加速。

关于python - Pandas :与模式匹配的固定补丁滚动相关,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/27733482/

相关文章:

python - 在 matplotlib 中绘制概率分布的积分

python - 在 Pandas (python) 中如何添加由两列引用的 groupby 的列

python - 从源导入会导致在处理绝对导入时找不到父模块

python - 在 python 中打印嵌套列表

python - glDrawPixels 和 numpy 问题

numpy - 将图像从 RGB 转换为 HSV 颜色空间

python - 重定向当前正在运行的 python 进程的 stdout

python - 一周内在不同的间隔运行 cron

python - 更新/合并 2 个具有不同列名的数据框

python - 如何将一系列数组转换为 pandas/numpy 中的单个矩阵?