python - 转置大数组而不加载到内存中

我有一个由 0 和 1 组成的大型压缩文件(5000 列 × 1M 行):

0 1 1 0 0 0 1 1 1....(×5000)
0 0 0 1 0 1 1 0 0
....(×1M)

我想转置它，但使用 numpy 或其他方法只会将整个表加载到 RAM 上，而我只有 6GB 可供使用。

出于这个原因，我想使用一种方法将每个转置行写入一个打开的文件，而不是将其存储在 RAM 中。我想出了以下代码:

import gzip

with open("output.txt", "w") as out:

    with gzip.open("file.txt", "rt") as file:

        number_of_columns = len(file.readline().split())

        # iterate over number of columns (~5000)
        for column in range(number_of_columns):

            # in each iteration, go to the top line to start again
            file.seek(0)

            # initiate list storing the ith column's elements that will form the transposed column
            transposed_column = []

            # iterate over lines (~1M), storing the ith element in the list
            for line in file:
                transposed_column.append(line.split()[column])

            # write the transposed column as a line to an existing file and back again
            out.write(" ".join(transposed_column) + "\n")

但是，这非常慢。有人可以建议我另一种解决方案吗？有什么方法可以将列表作为列(而不是行)附加到现有的打开文件中？ (伪代码):

with open("output.txt", w) as out:
    with gzip.open("file.txt", rt) as file:
        for line in file:
            transposed_line = line.transpose()
            out.write(transposed_line, as.column)

更新

user7813790 的回答让我找到这段代码:

import numpy as np
import random


# create example array and write to file

with open("array.txt", "w") as out:

    num_columns = 8
    num_lines = 24

    for i in range(num_lines):
        line = []
        for column in range(num_columns):
            line.append(str(random.choice([0,1])))
        out.write(" ".join(line) + "\n")


# iterate over chunks of dimensions num_columns×num_columns, transpose them, and append to file

with open("array.txt", "r") as array:

    with open("transposed_array.txt", "w") as out:

        for chunk_start in range(0, num_lines, num_columns):

            # get chunk and transpose
            chunk = np.genfromtxt(array, max_rows=num_columns, dtype=int).T
            # write out chunk
            out.seek(chunk_start+num_columns, 0)
            np.savetxt(out, chunk, fmt="%s", delimiter=' ', newline='\n')

它需要一个像这样的矩阵:

0 0 0 1 1 0 0 0
0 1 1 0 1 1 0 1
0 1 1 0 1 1 0 0
1 0 0 0 0 1 0 1
1 1 0 0 0 1 0 1
0 0 1 1 0 0 1 0
0 0 1 1 1 1 1 0
1 1 1 1 1 0 1 1
0 1 1 0 1 1 1 0
1 1 0 1 1 0 0 0
1 1 0 1 1 0 1 1
1 0 0 1 1 0 1 0
0 1 0 1 0 1 0 0
0 0 1 0 0 1 0 0
1 1 1 0 0 1 1 1
1 0 0 0 0 0 0 0
0 1 1 1 1 1 1 1
1 1 1 1 0 1 0 1
1 0 1 1 1 0 0 0
0 1 0 1 1 1 1 1
1 1 1 1 1 1 0 1
0 0 1 1 0 1 1 1
0 1 1 0 1 1 0 1
0 0 1 0 1 1 0 1

并迭代两个维度都等于列数(在本例中为 8)的 2D block ，转置它们并将它们附加到输出文件。

第一个 block 转置:

[[0 0 0 1 1 0 0 1]
 [0 1 1 0 1 0 0 1]
 [0 1 1 0 0 1 1 1]
 [1 0 0 0 0 1 1 1]
 [1 1 1 0 0 0 1 1]
 [0 1 1 1 1 0 1 0]
 [0 0 0 0 0 1 1 1]
 [0 1 0 1 1 0 0 1]]

第二 block 转置:

[[0 1 1 1 0 0 1 1]
 [1 1 1 0 1 0 1 0]
 [1 0 0 0 0 1 1 0]
 [0 1 1 1 1 0 0 0]
 [1 1 1 1 0 0 0 0]
 [1 0 0 0 1 1 1 0]
 [1 0 1 1 0 0 1 0]
 [0 0 1 0 0 0 1 0]]

等等

我正在尝试使用 out.seek() 将每个新 block 作为列附加到输出文件。据我了解， seek() 将文件开头(即列)的偏移量作为第一个参数，将 0 作为第二个参数意味着再次从第一行开始。所以，我猜想下面的行可以解决问题:

out.seek(chunk_start+num_columns, 0)

但是，它不会沿着后续行以该偏移量继续。此外，它在第一行的开头添加了 n = num_columns 个空格。输出:

    0 0 0 1 0 1 1 1 0 1 1 0 1 0 0 0
1 1 0 1 1 0 1 0
1 1 1 0 1 1 1 1
1 1 1 1 1 1 0 0
1 0 1 1 1 0 1 1
1 1 0 1 1 1 1 1
1 0 0 1 0 1 0 0
1 1 0 1 1 1 1 1

关于如何正确使用 seek() 完成此任务的任何见解？即生成这个:

0 0 0 1 1 0 0 1 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 0
0 1 1 0 1 0 0 1 1 1 1 0 1 0 1 0 1 1 0 1 1 0 1 0
0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 1 1 1 0 1 1 1 1
1 0 0 0 0 1 1 1 0 1 1 1 1 0 0 0 1 1 1 1 1 1 0 0
1 1 1 0 0 0 1 1 1 1 1 1 0 0 0 0 1 0 1 1 1 0 1 1
0 1 1 1 1 0 1 0 1 0 0 0 1 1 1 0 1 1 0 1 1 1 1 1
0 0 0 0 0 1 1 1 1 0 1 1 0 0 1 0 1 0 0 1 0 1 0 0
0 1 0 1 1 0 0 1 0 0 1 0 0 0 1 0 1 1 0 1 1 1 1 1

请注意，这只是一个虚拟测试矩阵，实际矩阵是 5008 列 × >1M 行。

更新 2

我已经弄清楚如何让它工作了，它还可以利用任何维度的 block 。

import numpy as np
import random


# create example array and write to file

num_columns = 4
num_lines = 8

with open("array.txt", "w") as out:
    for i in range(num_lines):
        line = []
        for column in range(num_columns):
            line.append(str(random.choice([0,1])))
        out.write(" ".join(line) + "\n")


# iterate over chunks of dimensions num_columns×chunk_length, transpose them, and append to file

chunk_length = 7

with open("array.txt", "r") as array:

    with open("transposed_array.txt", "w") as out:

        for chunk_start in range(0, num_lines, chunk_length):

            # get chunk and transpose
            chunk = np.genfromtxt(array, max_rows=chunk_length, dtype=str).T

            # write out chunk
            empty_line = 2 * (num_lines - (chunk_length + chunk_start))

            for i, line in enumerate(chunk):
                new_pos = 2 * num_lines * i + 2 * chunk_start
                out.seek(new_pos)
                out.write(f"{' '.join(line)}{' ' * (empty_line)}"'\n')

在这种情况下，它采用这样的数组:

并使用 4 列 × 7 行的 block 对其进行转置，因此第一个 block 将是

1 0 0 1 0 1 0
1 0 1 1 0 1 1
0 1 1 1 0 0 1
1 0 0 0 1 0 0

写入文件，从内存中删除，然后是第二 block

再次将它附加到文件中，所以最终结果是:

1 0 0 1 0 1 0 0
1 0 1 1 0 1 1 1
0 1 1 1 0 0 1 1
1 0 0 0 1 0 0 1

最佳答案

在您的有效但缓慢的解决方案中，您正在读取输入文件 5,000 次 - 这不会很快，但最小化读取的唯一简单方法是在内存中读取所有内容。

您可以尝试一些妥协，例如，一次将五十列读入内存 (~50MB)，然后将它们作为行写入文件。这样您将“仅”读取文件 100 次。尝试几种不同的组合以获得您满意的性能/内存折衷。

您将通过三个嵌套循环执行此操作:

遍历 block 的数量(在本例中为 100)
遍历输入文件的行
遍历 block 中的列数(此处为 50)

在最内层的循环中，您将列值作为一行收集到一个二维数组中，中间的每个循环对应一行。在最外层循环中，在进入内部循环之前清除数组，然后将其作为行打印到文件中。对于循环 1 的每次迭代。您将写入五十行一百万列。

如果不将整个目标文件加载到内存中，您就无法真正插入普通文件的中间——您需要手动向前移动尾随字节。但是，由于您知道确切的文件大小，因此可以预先分配它并在写入每个字节时始终寻找位置；进行 50 亿次查找可能不是很快......如果你的 ones 和 zeroes 分布相当均匀，你可以用全零初始化文件，然后只写 ones(或相反)以将数字减半的寻求。

编辑:添加了如何实现分块的详细信息。

关于python - 转置大数组而不加载到内存中，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56753237/

python - 转置大数组而不加载到内存中

上一篇：python - 字典键用字典值替换 pandas 数据框列中的字符串并执行评估

下一篇：python - Snakemake 在循环中使用规则