python - 为什么 np.unique 在某些情况下没有索引返回会变慢？

我注意到如果不将 True 传递给 return_index 参数，在某些情况下 np.unique 可能会变慢。

a = np.ones(shape = (1000, 50), dtype=int)
a[:,-7:] = [10000, -4750, -4750, 95, 95, 95, 95]
arr = np.cumsum(a.ravel())

%timeit np.unique(arr)
%timeit np.unique(arr, return_index=True)
%timeit np.unique(arr, return_index=True, return_inverse=True)
%timeit np.unique(arr, return_index=True, return_inverse=True, return_counts=True)

1.14 ms ± 22.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
711 µs ± 6.78 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
955 µs ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
1.3 ms ± 143 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

其他类型的数据通常不会出现这种情况。这里发生了什么？

最佳答案

之所以会出现差异，是因为您示例中的数据是排序的。 unique 对数据进行排序，当 return_index=True 时，使用稳定的合并排序算法。当合并排序应用于已排序的数据时，该算法将只遍历一次数据，因此速度非常快。

例如，在下面，arr 是一个非递减值数组:

In [10]: arr = np.random.randint(0, 3, size=50000).cumsum()

In [11]: arr
Out[11]: array([    1,     3,     4, ..., 49892, 49892, 49894])

默认排序算法花费的时间几乎是合并排序算法的 8 倍:

In [12]: %timeit np.sort(arr)
386 µs ± 6.23 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [13]: %timeit np.sort(arr, kind='mergesort')
49.5 µs ± 708 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

您可以在此处查看最终执行查找唯一值实际工作的代码:https://github.com/numpy/numpy/blob/6ff787b93d46cca6d31c370cfd9543ed573a98fc/numpy/lib/arraysetops.py#L320-L361

关于python - 为什么 np.unique 在某些情况下没有索引返回会变慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/71260832/

python - 为什么 np.unique 在某些情况下没有索引返回会变慢？

上一篇：sql-server - 没有 Entity Framework 的 .NET Core 6 SQL Server 连接

下一篇：c# - 使用 Task.WhenAll 但需要跟踪每个单独任务的成功