我有一个角度数组,我想将它们分组到数组中,它们之间的最大差异为 2 度。
例如:输入:
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
输出
('group', 1)
[[1]
[2]
[3]]
('group', 2)
[[4]
[4]
[5]]
('group', 3)
[[10]]
numpy.diff获取下一个元素与当前元素的差异,我需要下一个元素与组中第一个元素的差异
itertools.groupby将不在可定义范围内的元素分组
numpy.digitize按预定义范围而不是数组元素指定的范围对元素进行分组。 (也许我可以通过获取角度的唯一值,根据它们的差异对它们进行分组并将其用作预定义范围来使用它?)
.
我的方法可行,但似乎效率极低且非 Pythonic: (我正在使用 expand_dims 和 vstack 因为我正在使用一维数组(不仅仅是角度)但是我已经减少了它们以简化这个问题)
angles = np.array([[1],[2],[3],[4],[4],[5],[10]])
groupedangles = []
idx1 = 0
diffAngleMax = 2
while(idx1 < len(angles)):
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
for idx2 in xrange(idx1+1,len(angles)):
angleB = angles[idx2]
diffAngle = angleB - angleA
if abs(diffAngle) <= diffAngleMax:
group = np.vstack((group,angleB))
else:
idx1 = idx2
groupedangles.append(group)
break
if idx2 == len(angles) - 1:
if idx1 == idx2:
angleA = angles[idx1]
group = np.expand_dims(angleA, axis=0)
groupedangles.append(group)
break
for idx, x in enumerate(groupedangles):
print('group', idx+1)
print(x)
什么是更好更快的方法?
最佳答案
更新 这是一些 Cython 处理
In [1]: import cython
In [2]: %load_ext Cython
In [3]: %%cython
...: import numpy as np
...: cimport numpy as np
...: def cluster(np.ndarray array, np.float64_t maxdiff):
...: cdef np.ndarray[np.float64_t, ndim=1] flat = np.sort(array.flatten())
...: cdef list breakpoints = []
...: cdef np.float64_t seed = flat[0]
...: cdef np.int64_t int = 0
...: for i in range(0, len(flat)):
...: if (flat[i] - seed) > maxdiff:
...: breakpoints.append(i)
...: seed = flat[i]
...: return np.split(array, breakpoints)
...:
稀疏度检验
In [4]: angles = np.random.choice(np.arange(5000), 500).astype(np.float64)[:, None]
In [5]: %timeit cluster(angles, 2)
422 µs ± 12.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
重复测试
In [6]: angles = np.random.choice(np.arange(500), 1500).astype(np.float64)[:, None]
In [7]: %timeit cluster(angles, 2)
263 µs ± 14.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
两项测试均显示出显着的改进。该算法现在对输入进行排序,并对排序后的数组进行单次运行,使其稳定 O(N*log(N))。
预更新
这是种子聚类的变体。它不需要排序
def cluster(array, maxdiff):
tmp = array.copy()
groups = []
while len(tmp):
# select seed
seed = tmp.min()
mask = (tmp - seed) <= maxdiff
groups.append(tmp[mask, None])
tmp = tmp[~mask]
return groups
例子:
In [27]: cluster(angles, 2)
Out[27]:
[array([[1],
[2],
[3]]), array([[4],
[4],
[5]]), array([[10]])]
500、1000 和 1500 角度的基准:
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.25 ms ± 60.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(500), 1000)[:, None]
In [7]: %timeit cluster(angles, 2)
1.46 ms ± 37 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [8]: angles = np.random.choice(np.arange(500), 1500)[:, None]
In [9]: %timeit cluster(angles, 2)
1.99 ms ± 72.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
虽然该算法在最坏情况下为 O(N^2),在最佳情况下为 O(N),但上述基准清楚地显示了近线性时间增长,因为实际运行时间取决于数据结构:稀疏性和重复率。在大多数真实情况下,您不会遇到最坏的情况。
一些稀疏性基准
In [4]: angles = np.random.choice(np.arange(500), 500)[:, None]
In [5]: %timeit cluster(angles, 2)
1.06 ms ± 27.9 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [6]: angles = np.random.choice(np.arange(1000), 500)[:, None]
In [7]: %timeit cluster(angles, 2)
1.79 ms ± 117 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [8]: angles = np.random.choice(np.arange(1500), 500)[:, None]
In [9]: %timeit cluster(angles, 2)
2.16 ms ± 90.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [10]: angles = np.random.choice(np.arange(5000), 500)[:, None]
In [11]: %timeit cluster(angles, 2)
3.21 ms ± 139 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
关于python - Numpy 按元素之间的差异范围分组,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48276631/