python - 基于稀疏信息填充数组

我有以下稀疏结构来描述底层密集数组A:

a = np.array([1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1])
b = np.array([1, 5, 2, 3])

每当 A 更改值时，

a 就会包含 1。每当 A 更改值时，b 都会包含新值。也就是说，我的 a, b 示例生成以下数组:

A = np.array([1, 1, 1, 1, 1, 5, 5, 2, 2, 2, 3])

鉴于稀疏信息，如何有效地恢复A？我对当 b 是 n 维时可以扩展的解决方案特别感兴趣。

在 2d 中，我们会有相同的 a，但是

bb = np.array([[1, 5, 2, 2], [2, -1, 0, 1]])

产生

AA = np.array([[1, 1, 1, 1, 1, 5, 5, 2, 2, 2, 3], [2, 2, 2, 2, 2, -1, -1, 0, 0, 0, 1]])

最佳答案

使用cumsum确实非常简单。使用 cumsum 获取这些间隔索引，然后索引到数据数组中。

因此，对于 1D 数据 -

idx = a.cumsum(-1)-1
out = b[idx]

对于 2D 数据 -

out = bb[np.arange(bb.shape[0])[:,None],idx]

对于通用 n-dim 数据，只需使用 np.take 沿最后一个轴进行索引，从而覆盖通用 n-dim > 案例，就像这样 -

np.take(b_ndarray,idx,axis=-1)

示例运行

In [80]: a  # sparse array that defines the intervals/indices
Out[80]: array([1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1])

In [81]: b # 1D data array
Out[81]: array([1, 5, 2, 3])

In [82]: bb  # 2D data array
Out[82]: 
array([[ 1,  5,  2,  2],
       [ 2, -1,  0,  1]])

In [93]: idx = a.cumsum(-1)-1 # Get the intervaled indices

In [94]: idx
Out[94]: array([0, 0, 0, 0, 0, 1, 1, 2, 2, 2, 3])

In [84]: np.take(b,idx,axis=-1) # output for 1D data
Out[84]: array([1, 1, 1, 1, 1, 5, 5, 2, 2, 2, 3])

In [85]: np.take(bb,idx,axis=-1)  # output for 2D data
Out[85]: 
array([[ 1,  1,  1,  1,  1,  5,  5,  2,  2,  2,  2],
       [ 2,  2,  2,  2,  2, -1, -1,  0,  0,  0,  1]])

让我们也测试一些随机 3D 数据 -

In [89]: bbb = np.random.randint(-4,5,(2,3,4))

In [90]: bbb
Out[90]: 
array([[[-1,  0,  0,  4],
        [ 0, -1,  3,  1],
        [ 1, -4, -3,  1]],

       [[-1, -4,  1, -4],
        [-3, -2,  0, -2],
        [-4, -1, -2, -4]]])

In [91]: np.take(bbb,idx,axis=-1)
Out[91]: 
array([[[-1, -1, -1, -1, -1,  0,  0,  0,  0,  0,  4],
        [ 0,  0,  0,  0,  0, -1, -1,  3,  3,  3,  1],
        [ 1,  1,  1,  1,  1, -4, -4, -3, -3, -3,  1]],

       [[-1, -1, -1, -1, -1, -4, -4,  1,  1,  1, -4],
        [-3, -3, -3, -3, -3, -2, -2,  0,  0,  0, -2],
        [-4, -4, -4, -4, -4, -1, -1, -2, -2, -2, -4]]])

运行时测试

其他方法 -

def diff_repeat_1d(a, b): # @Kasramvd's soln for 1D
    inds = np.concatenate((np.where(a)[0], [a.size]))
    durations = np.diff(inds)
    return np.repeat(b, durations)

def diff_repeat_2d(a, b): # @Kasramvd's soln for 2D
    inds = np.concatenate((np.where(a)[0], [a.size]))
    durations = np.diff(inds)
    return np.repeat(bb, durations, axis=1)

一维数据的时序 -

In [199]: a = np.array([1, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1])
     ...: b = np.array([1, 5, 2, 3])
     ...: 

In [200]: a = np.tile(a,100000)
     ...: b = np.tile(b,100000)
     ...: 

In [201]: %timeit diff_repeat_1d(a, b) # @Kasramvd's soln
100 loops, best of 3: 8.42 ms per loop

In [202]: %timeit np.take(b,a.cumsum()-1,axis=-1)
100 loops, best of 3: 4.53 ms per loop

二维数据的计时 -

In [203]: bb = np.array([[1, 5, 2, 2], [2, -1, 0, 1]])

In [204]: bb = np.tile(bb,100000)

In [206]: %timeit diff_repeat_2d(a, bb) # @Kasramvd's soln
100 loops, best of 3: 12.1 ms per loop

In [207]: %timeit np.take(bb,a.cumsum()-1,axis=-1)
100 loops, best of 3: 5.58 ms per loop

关于python - 基于稀疏信息填充数组，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/46682285/

python - 基于稀疏信息填充数组

上一篇：python - 将月末或周末系列转换为每日系列

下一篇：python - dask 数据帧计数值