我有一个形状为 (n, 3)
的 numpy 数组 a
,其中填充了从 0
到 m
的整数>。 m
和 n
都可以相当大。众所周知,从 0
到 m
的每个整数有时只出现一次,但大多数情况下在 a
中的某处恰好出现两次。连续没有重复的索引。
我现在想构造“反向”索引,即两个形状为 (m, 2)
的数组 b_row
和 b_col
每一行都包含 a
中的(一个或两个)行/列索引,其中 row_idx
出现在 a
中。
这有效:
import numpy
a = numpy.array([
[0, 1, 2],
[0, 1, 3],
[2, 3, 4],
[4, 5, 6],
# ...
])
print(a)
b_row = -numpy.ones((7, 2), dtype=int)
b_col = -numpy.ones((7, 2), dtype=int)
count = numpy.zeros(7, dtype=int)
for k, row in enumerate(a):
i = count[row]
b_row[row, i] = k
b_col[row, i] = [0, 1, 2]
count[row] += 1
print(b_row)
print(b_col)
[[0 1 2]
[0 1 3]
[2 3 4]
[4 5 6]]
[[ 0 1]
[ 0 1]
[ 0 2]
[ 1 2]
[ 2 3]
[ 3 -1]
[ 3 -1]]
[[ 0 0]
[ 1 1]
[ 2 0]
[ 2 1]
[ 2 0]
[ 1 -1]
[ 2 -1]]
但是由于a
上的显式循环而很慢。
有关如何加快速度的任何提示?
最佳答案
这是一个解决方案:
import numpy as np
m = 7
a = np.array([
[0, 1, 2],
[0, 1, 3],
[2, 3, 4],
[4, 5, 6],
# ...
])
print('a:')
print(a)
a_flat = a.flatten() # Or a.ravel() if can modify original array
v1, idx1 = np.unique(a_flat, return_index=True)
a_flat[idx1] = -1
v2, idx2 = np.unique(a_flat, return_index=True)
v2, idx2 = v2[1:], idx2[1:]
rows1, cols1 = np.unravel_index(idx1, a.shape)
rows2, cols2 = np.unravel_index(idx2, a.shape)
b_row = -np.ones((m, 2), dtype=int)
b_col = -np.ones((m, 2), dtype=int)
b_row[v1, 0] = rows1
b_col[v1, 0] = cols1
b_row[v2, 1] = rows2
b_col[v2, 1] = cols2
print('b_row:')
print(b_row)
print('b_col:')
print(b_col)
输出:
a:
[[0 1 2]
[0 1 3]
[2 3 4]
[4 5 6]]
b_row:
[[ 0 1]
[ 0 1]
[ 0 2]
[ 1 2]
[ 2 3]
[ 3 -1]
[ 3 -1]]
b_col:
[[ 0 0]
[ 1 1]
[ 2 0]
[ 2 1]
[ 2 0]
[ 1 -1]
[ 2 -1]]
编辑:
IPython 中用于比较的小基准。如@eozd所示由于 np.unique
在 O(n) 中运行,算法复杂度原则上更高,但对于实际大小来说,矢量化解决方案似乎仍然要快得多:
import numpy as np
def method_orig(a, m):
b_row = -np.ones((m, 2), dtype=int)
b_col = -np.ones((m, 2), dtype=int)
count = np.zeros(m, dtype=int)
for k, row in enumerate(a):
i = count[row]
b_row[row, i] = k
b_col[row, i] = [0, 1, 2]
count[row] += 1
return b_row, b_col
def method_jdehesa(a, m):
a_flat = a.flatten() # Or a.ravel() if can modify original array
v1, idx1 = np.unique(a_flat, return_index=True)
a_flat[idx1] = -1
v2, idx2 = np.unique(a_flat, return_index=True)
v2, idx2 = v2[1:], idx2[1:]
rows1, cols1 = np.unravel_index(idx1, a.shape)
rows2, cols2 = np.unravel_index(idx2, a.shape)
b_row = -np.ones((m, 2), dtype=int)
b_col = -np.ones((m, 2), dtype=int)
b_row[v1, 0] = rows1
b_col[v1, 0] = cols1
b_row[v2, 1] = rows2
b_col[v2, 1] = cols2
return b_row, b_col
n = 100000
c = 3
m = 200000
# Generate random input
# This does not respect "no doubled indices in row" but is good enough for testing
np.random.seed(100)
a = np.random.permutation(np.concatenate([np.arange(m), np.arange(m)]))[:(n * c)].reshape((n, c))
%timeit method_orig(a, m)
# 3.22 s ± 1.3 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit method_jdehesa(a, m)
# 108 ms ± 764 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
关于python - 加速索引 "revert",我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50389518/