python - 为什么这个Python逻辑索引占用这么多内存

我有一个 1e8 长的一维时间序列(100,000,000 个元素)。 Here是指向我在 Dropbox 上使用的数据的链接。 (文件大小为 382 MB。)

更新

根据memory_profiling，错误发生在行

data[absolute(data-dc)< m*std(data)]=dc.

更具体地说，absolute(data-dc) 操作会耗尽所有内存。 Data 如上所述，dc 是一个常量。也许这是一个微妙的语法错误？

我想从中移除异常值和伪像，并将这些值替换为中值。我尝试使用以下功能来做到这一点。

 from numpy import *

 from sys import argv

 from scipy.io import savemat
 from scipy.stats import scoreatpercentile

 def reject_outliers(data,dc,m=3):
      data[data==0] = dc
      data[bp.absolute(data-dc) < m*np.std(data)] = dc
      return data

 def butter_bandpass(lowcut,highcut,fs,order=8):
    nyq = 0.5*fs
    low = lowcut/nyq
    high = highcut/nyq

    b,a= butter(order, [low, high], btype='band')
    return b,a

 def butter_bandpass_filter(data,lowcut,highcut,fs,order=8):
    b,a = butter_bandpass(lowcut,highcut,fs,order=order)
    return lfilter(b,a,data) 

 OFFSET = 432
 filename = argv[1]
 outname = argv[2]  

 print 'Opening '+ filename
 with open(filename,'rb') as stream:
      stream.seek(OFFSET)
      data=fromfile(stream,dtype='int16')
 print 'Removing Artifacts, accounting for zero-filling'
 dc = median(data)
 data = reject_outliers(data,dc)

 threshold = scoreatpercentile(absolute(data),85)   
 print 'Filtering and Detrending'
 data = butter_bandpass_filter(data,300,7000,20000)
 savemat(outname+'.mat',mdict={'data':data})

在一个文件上调用它会占用 4 GB RAM 和 3 GB 虚拟内存。我确定它是这个函数的第二行，因为我单步执行了我写的脚本，它总是卡在这部分。我什至可以看到(在 OS X 上的 Finder 中)可用硬盘空间每秒都在急剧下降。

时间序列不够长，无法解释。 reject-outliers第二行有什么问题？

最佳答案

我刚刚生成了 100,000,000 个随机 float 并进行了与您描述的相同的索引。整个内存使用量远低于 1 GB。您的代码还有什么您没有告诉我们的？尝试通过出色的 memory_profiler 运行您的代码.

编辑:添加了 memory_profiler 的代码和输出:

from numpy.random import uniform
import numpy

@profile
def go(m=3):
    data = uniform(size=100000000)
    dc = numpy.median(data)
    data[numpy.absolute(data-dc) < m*numpy.std(data)] = dc
    return data

if __name__ == '__main__':
    go()

输出:

Filename: example.py

Line #    Mem usage    Increment   Line Contents
================================================
     3                             @profile
     4     15.89 MB      0.00 MB   def go(m=3):
     5    778.84 MB    762.95 MB    data = uniform(size=100000000)
     6    778.91 MB      0.06 MB    dc = numpy.median(data)
     7    874.34 MB     95.44 MB    data[numpy.absolute(data-dc) < m*numpy.std(data)] = dc
     8    874.34 MB      0.00 MB    return data

如您所见，100M float 不会占用那么多内存。

关于python - 为什么这个Python逻辑索引占用这么多内存，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14688510/

python - 为什么这个Python逻辑索引占用这么多内存

上一篇：python - python中的计数器可以这样比较吗？

下一篇：Python for 循环不从列表中迭代 '0'