我有一个 10GB(RAM 放不下)的格式文件:
Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4
所以我们有列和缺失值,我想计算第 2 列和第 3 列的均值。使用普通 python,我会做类似的事情:
def means(rng):
s, e = rng
with open("data.csv") as fd:
title = next(fd)
titles = title.split(',')
print "Means for", ",".join(titles[s:e])
ret = [0] * (e-s)
for c, l in enumerate(fd):
vals = l.split(",")[s:e]
for i, v in enumerate(vals):
try:
ret[i] += int(v)
except ValueError:
pass
return map(lambda s: float(s) / (c + 1), ret)
但我怀疑有一种更快的方法可以用 numpy 做事(我还是个新手)。
最佳答案
Pandas是你最好的 friend :
from pandas.io.parsers import read_csv
from numpy import sum
# Load 10000 elements at a time, you can play with this number to get better
# performance on your machine
my_data = read_csv("data.csv", chunksize=10000)
total = 0
count = 0
for chunk in my_data:
# If you want to exclude NAs from the average, remove the next line
chunk = chunk.fillna(0.0)
total += chunk.sum(skipna=True)
count += chunk.count()
avg = total / count
col1_avg = avg["Col1"]
# ... etc. ...
关于python - 使用 python 的 numpy 从 csv 计算方法,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/25771916/