我有一个文件,其中包含数字 [0-9] 矩阵,不带分隔符,形状为 (N, M)。 N 约为 50k,M 约为 50k。
例如,矩阵文件的小版本是 mat.txt
0012230012000
0012230002300
0012230004200
现在我正在使用以下代码,但我对速度不太满意。
def read_int_mat(path):
"""
Read a matrix of integer with [0-9], and with no delimiter.
"""
with open(path) as f:
mat = np.array(
[np.array([int(c) for c in line.strip()]) for line in f.readlines()],
dtype=np.int8,
)
return mat
编辑: 这是一个迷你基准
import numpy as np
def read_int_mat(path):
"""
Read a matrix of integer with [0-9], and with no delimiter.
"""
with open(path) as f:
mat = np.array(
[np.array([int(c) for c in line.strip()]) for line in f.readlines()],
dtype=np.int8,
)
return mat
%timeit read_int_mat("mat.txt")
%timeit np.genfromtxt("mat.txt", delimiter=1, dtype="int8")
print(read_int_mat("mat.txt"))
print(np.genfromtxt("mat.txt", delimiter=1, dtype="int8"))
输出是:
61.6 µs ± 1.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
327 µs ± 4.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
[0 0 1 2 2 3 0 0 0 2 3 0 0]
[0 0 1 2 2 3 0 0 0 4 2 0 0]]
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
[0 0 1 2 2 3 0 0 0 2 3 0 0]
[0 0 1 2 2 3 0 0 0 4 2 0 0]]
有什么我可以尝试进一步加快速度的吗? Cython 会帮忙吗?非常感谢。
最佳答案
您可以使用np.genfromtxt
,例如:
文件(13 列):
0012230012000
0012230002300
0012230004200
然后:
x = np.genfromtxt("file.txt", delimiter=1, dtype="int8")
print(x)
打印:
[[0 0 1 2 2 3 0 0 1 2 0 0 0]
[0 0 1 2 2 3 0 0 0 2 3 0 0]
[0 0 1 2 2 3 0 0 0 4 2 0 0]]
编辑:带有 np.fromiter
的版本并以二进制模式打开文件:
def read_npfromiter(path):
with open(path, "rb") as f:
return np.array(
[np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
)
形状为(168, 9360)的文件基准
:
from timeit import timeit
def read_int_mat(path):
"""
Read a matrix of integer with [0-9], and with no delimiter.
"""
with open(path, "r") as f:
mat = np.array(
[
np.array([int(c) for c in line.strip()])
for line in f.readlines()
],
dtype=np.int8,
)
return mat
def read_npfromiter(path):
with open(path, "rb") as f:
return np.array(
[np.fromiter((chr(c) for c in l.strip()), dtype="int8") for l in f],
)
def f1(f):
return np.genfromtxt(
f, delimiter=1, dtype="int8", autostrip=False, encoding="ascii"
)
def f2(f):
return read_int_mat(f)
def f3(f):
return read_npfromiter(f)
t1 = timeit(lambda: f1("file.txt"), number=1)
t2 = timeit(lambda: f2("file.txt"), number=1)
t3 = timeit(lambda: f3("file.txt"), number=1)
print(t1)
print(t2)
print(t3)
结果:
1.0680423599551432
0.28135157003998756
0.19099885696778074
关于python - 在python中有效读取没有分隔符的数字矩阵,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67289780/