algorithm - 选择算法运行时

标签 algorithm sorting data-retrieval

我正在尝试找出最佳方法来计算某些数据聚合的前 k 查询,比方说一个数组。我曾经认为最好的方法是遍历数组并维护一个大小为 k 的堆或平衡二叉树,利用它来计算 top-k 值。现在,我遇到了据说运行得更快的选择算法。我了解选择算法的工作原理以及如何实现它,但我对它在 O(n) 中的运行方式感到有点困惑。我觉得为了让它在 O(n) 中运行,你必须非常幸运。如果你一直选择一个随机的枢轴点并围绕它进行分区,很可能你最终基本上对几乎整个数组进行了排序,然后才发现你的第 k 个索引。是否有任何优化,例如可能不选择随机枢轴?或者我维护的堆/树方法是否足以满足大多数情况。

最佳答案

你说的是quickselect, also known as Hoare's selection algorithm .

它确实有 O(n)平均情况下的表现,但其最坏情况下的表现是O(n<sup>2</sup>) .

Like quicksort, the quickselect has good average performance, but is sensitive to the pivot that is chosen. If good pivots are chosen, meaning ones that consistently decrease the search set by a given fraction, then the search set decreases in size exponentially and by induction (or summing the geometric series) one sees that performance is linear, as each step is linear and the overall time is a constant times this (depending on how quickly the search set reduces). However, if bad pivots are consistently chosen, such as decreasing by only a single element each time, then worst-case performance is quadratic: O(n<sup>2</sup>).

在选择枢轴方面:

The easiest solution is to choose a random pivot, which yields almost certain linear time. Deterministically, one can use median-of-3 pivot strategy (as in quicksort), which yields linear performance on partially sorted data, as is common in the real world. However, contrived sequences can still cause worst-case complexity; David Musser describes a "median-of-3 killer" sequence that allows an attack against that strategy, which was one motivation for his introselect algorithm.

One can assure linear performance even in the worst case by using a more sophisticated pivot strategy; this is done in the median of medians algorithm. However, the overhead of computing the pivot is high, and thus this is generally not used in practice. One can combine basic quickselect with median of medians as fallback to get both fast average case performance and linear worst-case performance; this is done in introselect.

(引自 Wikipedia )

所以你很可能得到 O(n)随机枢轴的性能,但是,如果 k很小而且n很大,或者如果你不太可能,O(n log k)使用尺寸 k 的解决方案堆或 BST 的性能可能优于此。

我们无法确定地告诉您哪一个会更快 - 它取决于 (1) 确切的实现,(2) 它运行的机器,(3) n 的确切大小和 k最后 (4) 实际数据。 O(n log k)解决方案应该足以满足大多数目的。

关于algorithm - 选择算法运行时,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/21174938/

相关文章:

c++ - 数据检索和索引

php - 使用 PHP 从 MySQL 检索表的最佳方法是什么?

algorithm - Boyer-Moore 算法中的移位规则

python - 通过使用 Python 给出属于该文档的字段值来获取 firestore 中文档的数据?

algorithm - 以微秒精度压缩 unix 时间戳

c - 输入循环过早退出

java - 只看算法代码计算时间复杂度

c++ - std::stable_sort 与 std::sort

c++ - 组合学 - 糖果

algorithm - 优化计算两个边界之间的距离