python - 读取大型 csv 文件的随机行、python、pandas

你能帮我吗，我在 Windows(8 Gb RAM)上使用 0.18.1 pandas 和 2.7.10 Python 从大型 csv 文件中读取随机行时遇到了问题。

在 Read a small random sample from a big CSV file into a Python data frame 我看到了一种方法，但是，我的电脑非常消耗内存，即部分代码:

n = 100
s = 10
skip = sorted(rnd.sample(xrange(1, n), n-s))# skip n-s random rows from *.csv       
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip)

所以，如果我想从文件中随机获取一些行，不仅要考虑 100 行，还要考虑 100 000 行，这会变得很困难，但是不从文件中随机获取行几乎是可以的:

skiprows = xrange(100000)    
data = pd.read_csv(path, usecols = ['Col1', 'Col2'], 
                   dtype  = {'Col1': 'int32', 'Col2':'int32'}, skiprows = skip, nrows = 10000)

所以问题是我如何处理用 pandas 从大型 csv 文件中读取大量随机行，即因为我无法读取整个 csv 文件，即使将其分块，我也对随机行完全感兴趣。谢谢

最佳答案

如果内存是最大的问题，一个可能的解决方案可能是使用 block ，并从 block 中随机选择

n = 100
s = 10
factor = 1    # should be integer
chunksize = int(s/factor)

reader = pd.read_csv(path, usecols = ['Col1', 'Col2'],dtype  = {'Col1': 'int32', 'Col2':'int32'}, chunksize=chunksize)

out = []
tot = 0
for df in reader:
    nsample = random.randint(factor,chunksize)
    tot += nsample
    if  tot > s:
        nsample = s - (tot - nsample)
    out.append(df.sample(nsample))
    if tot >= s:
        break

data = pd.concat(out)

并且您可以使用因子来控制 block 的大小。

关于python - 读取大型 csv 文件的随机行、python、pandas，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/38233719/

python - 读取大型 csv 文件的随机行、python、pandas

上一篇：python - 来自多个字符串的 TreeMap ，Python

下一篇：python - 在 django 中提供多对多相关模型的模型知识