python - 使用 python 的 numpy 从 csv 计算方法

我有一个 10GB(RAM 放不下)的格式文件:

Col1,Col2,Col3,Col4
1,2,3,4
34,256,348,
12,,3,4

所以我们有列和缺失值，我想计算第 2 列和第 3 列的均值。使用普通 python，我会做类似的事情:

def means(rng):
    s, e = rng

    with open("data.csv") as fd:
        title = next(fd)
        titles = title.split(',')
        print "Means for", ",".join(titles[s:e])

        ret = [0] * (e-s)
        for c, l in enumerate(fd):
            vals = l.split(",")[s:e]
            for i, v in enumerate(vals):
                try:
                    ret[i] += int(v)
                except ValueError:
                    pass

        return map(lambda s: float(s) / (c + 1), ret)

但我怀疑有一种更快的方法可以用 numpy 做事(我还是个新手)。

最佳答案

Pandas是你最好的 friend :

from pandas.io.parsers import read_csv
from numpy import sum

# Load 10000 elements at a time, you can play with this number to get better
# performance on your machine
my_data = read_csv("data.csv", chunksize=10000)

total = 0
count = 0

for chunk in my_data:
    # If you want to exclude NAs from the average, remove the next line
    chunk = chunk.fillna(0.0)

    total += chunk.sum(skipna=True)
    count += chunk.count()

avg = total / count

col1_avg = avg["Col1"]
# ... etc. ...

关于python - 使用 python 的 numpy 从 csv 计算方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/25771916/

上一篇：python - 如何正确定义子类属性

下一篇：python - 在 python 中找不到模块

python - 导入包含在 "when some of them contains "和逗号中的值的 CSV 文件

python - 读取 CSV 文件中的所有列？

python - 读取树莓派的GPIO引脚

python - 如何绘制烛台

python - 我的 numpy (python) 旋转矩阵不工作

python - 大数组的 if,else 语句

python - 如何确定 HSL 颜色检测的上下边界？

python - 将参数解析为字典 argparse

python - 如何在另一种情况下在 Pandas 中创建滚动窗口