python-3.x - 对大型文本文件中的数据进行排序并将其转换为数组

我有一个包含一些数据的文本文件。

#this is a sample file
# data can be used for practice
total number = 5

t=1
dx= 10 10
dy= 10 10
dz= 10 10

1 0.1 0.2 0.3
2 0.3 0.4 0.1
3 0.5 0.6 0.9
4 0.9 0.7 0.6
5 0.4 0.2 0.1

t=2
dx= 10 10
dy= 10 10
dz= 10 10

1 0.11 0.25 0.32
2 0.31 0.44 0.12
3 0.51 0.63 0.92
4 0.92 0.72 0.63
5 0.43 0.21 0.14

t=3
dx= 10 10
dy= 10 10
dz= 10 10

1 0.21 0.15 0.32
2 0.41 0.34 0.12
3 0.21 0.43 0.92
4 0.12 0.62 0.63
5 0.33 0.51 0.14

我的目的是读取文件，找出列值为1和5的行，并将它们存储为多维数组。像1一样是a1=[[0.1, 0.2, 0.3],[0.11, 0.25, 0.32],[0.21, 0.15, 0.32]]，5一样是a5=[[0.4, 0.2, 0.1],[0.43, 0.21, 0.14],[0.33, 0.51, 0.14]]。

这是我编写的代码，

import numpy as np
with open("position.txt","r") as data:
    lines = data.read().split(sep='\n')
    a1 = []
    a5 = []
    for line in lines:

        if(line.startswith('1')):
            a1.append(list(map(float, line.split()[1:])))
        elif (line.startswith('5')):
            a5.append(list(map(float, line.split()[1:])))
a1=np.array(a1)
a5=np.array(a5)

我的代码与上载的示例文件完美配合，但实际上我的文件很大(2gb)。用我的代码处理会引起内存错误。我该如何解决这个问题？我的工作站中有96GB。

最佳答案

有几项需要改进的地方:

不要尝试将整个文本文件加载到内存中(这将节省2 GB)。

使用numpy数组(而不是列表)来存储数字数据。

使用单精度浮点数而不是 double 数。

因此，您需要估计阵列的大小。看起来2 GB输入数据可能有1600万条记录。对于32位浮点数，您需要16e6 * 2 * 4 = 128 MB的内存。对于500 GB的输入，它将适合33 GB的内存(假设您具有相同的120字节记录大小)。

import numpy as np
nmax = int(20e+6) # take a bit of safety margin

a1 = np.zeros((nmax, 3), dtype=np.float32)
a5 = np.zeros((nmax, 3), dtype=np.float32)
n1 = n5 = 0

with open("position.txt","r") as data:
    for line in data:
        if '0' <= line[0] <= '9':
            values = np.fromstring(line, dtype=np.float32, sep=' ')
            if values[0] == 1:
                a1[n1] = values[1:] 
                n1 += 1
            elif values[0] == 5:
                a5[n5] = values[1:]
                n5 += 1

# trim (no memory is released)
a1 = a1[:n1]
a5 = a5[:n5]

请注意，通常不建议使用浮点数(==)，但是对于value[0]==1，我们知道这是一个小整数，对于浮点数表示精确。

如果要节省内存(例如，如果要并行运行多个python进程)，则可以将数组初始化为磁盘映射的数组，如下所示:

a1 = np.memmap('data_1.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))
a5 = np.memmap('data_5.bin', dtype=np.float32, mode='w+', shape=(nmax, 3))

使用memmap，文件将不包含任何有关数据类型和数组形状(或人类可读的描述)的元数据。我建议您在一个单独的作业中将数据转换为npz格式；不要并行运行这些作业，因为它们会将整个阵列加载到内存中。

n = 3
a1m = np.memmap('data_1.bin', dtype=np.float32, shape=(n, 3))
a5m = np.memmap('data_5.bin', dtype=np.float32, shape=(n, 3))
np.savez('data.npz', a1=a1m, a5=a5m, info='This is test data from SO')

您可以像这样加载它们:

data = np.load('data.npz')
a1 = data['a1']

根据磁盘空间成本，处理时间和内存之间的平衡，可以压缩数据。

import zlib
zlib.Z_DEFAULT_COMPRESSION = 3 # faster for lower values
np.savez_compressed('data.npz', a1=a1m, a5=a5m, info='...')

如果float32的精度超出您的要求，则可以truncate the binary representation for better compression。

如果您喜欢内存映射文件，则可以npy格式保存:

np.save('data_1.npy', a1m)
a1 = np.load('data_1.npy', mmap_mode='r+')

但是，那么您将无法使用压缩，最终会遇到许多无元数据的文件(数组大小和数据类型除外)。

关于python-3.x - 对大型文本文件中的数据进行排序并将其转换为数组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/62309988/

python-3.x - 对大型文本文件中的数据进行排序并将其转换为数组

上一篇：node.js - 如何在ejs中显示快速错误

下一篇：reactjs - 如何通过检查reactjs中的错误来修复错误？