c++ - 为什么当数组有重复值时快速排序算法持续时间会增加?

标签 c++ performance time-complexity quicksort mergesort

我正在尝试使用 std::chrono 时间计算并使用某个范围 [A, B] 内随机生成的整数数组来测量合并排序和快速排序函数的持续时间,数组的大小从 5000 到100,000 个整数。

我的代码的目标是证明当在快速排序中选择(枢轴)的方法得到改进时,快速排序函数最终比合并排序花费更少的时间来处理数组,我选择枢轴的方式正在使用随机索引方法来最小化复杂度为 (n^2) 的概率,但是在我将在下面描述的某些情况下,快速排序最终会比合并排序花费更多的时间,我想知道为什么会这样发生。

案例一: 数组中数字的范围较小,增加了数组中出现重复数字的概率。

案例二: 当我使用像 clion 这样的本地 IDE 时,快速排序功能比合并排序花费更多的时间,但是像 IDEONE.com 这样的在线编译器在两种排序算法中给出了相似的结果(即使生成的整数范围很小)

这是我在上述情况下得到的结果(第一行数字是归并排序结果,第二行是快速排序结果):

1-clion 结果数字范围窄 (-100, 600) clion results narrow range of numbers (-100, 600)

具有广泛数字(INT_MIN、INT_MAX)的 2-clion 结果 clion results with a wide range of numbers (INT_MIN, INT_MAX)

3-IDEONE 结果的数字范围很窄 (-100, 600) IDEONE results with a narrow range of numbers (-100, 600)

4- IDEONE 结果具有广泛的数字(INT_MIN、INT_MAX) IDEONE results with a wide range of numbers (INT_MIN, INT_MAX)

#include <bits/stdc++.h>
#include <chrono>
#include <random>

using namespace std;

mt19937 gen(chrono::steady_clock::now().time_since_epoch().count());
int* generateArray(int size)
{
    int* arr = new int[size];
    uniform_int_distribution<> distribution(INT_MIN, INT_MAX);
    for (int i=0; i < size; ++i)
    {
        arr[i] = distribution(gen);
    }
    return arr;
}
void merge(int* leftArr, int nL, int* rightArr, int nR, int* mainArr)
{
    int i=0, j=0, k=0;
    while (i < nL && j < nR)
    {
        if (leftArr[i] < rightArr[j]) { mainArr[k++] = leftArr[i++]; }
        else { mainArr[k++] = rightArr[j++]; }
    }
    while (i < nL){ mainArr[k++] = leftArr[i++]; }
    while (j < nR){ mainArr[k++] = rightArr[j++]; }
}
void mergeSort (int* mainArray, int arrayLength)
{
    if (arrayLength < 2) { return; }
    int mid = arrayLength/2;
    int* leftArray = new int[mid];
    int* rightArray = new int[arrayLength - mid];
    for (int i=0; i<mid; ++i) {leftArray[i] = mainArray[i];}
    for (int i = mid; i<arrayLength; ++i) {rightArray[i - mid] = mainArray[i];}
    mergeSort(leftArray, mid);
    mergeSort(rightArray, arrayLength-mid);
    merge(leftArray, mid, rightArray, arrayLength-mid, mainArray);
    delete[] leftArray;
    delete[] rightArray;
}


int partition (int* arr, int left, int right)
{
    uniform_int_distribution<> distribution(left, right);
    int idx = distribution(gen);
    swap(arr[right], arr[idx]);
    int pivot = arr[right];
    int partitionIndex = left;
    for (int i = left; i < right; ++i)
    {
        if (arr[i] <= pivot)
        {
            swap(arr[i], arr[partitionIndex]);
            partitionIndex++;
        }
    }
    swap(arr[right], arr[partitionIndex]);
    return partitionIndex;
}
void quickSort (int* arr, int left, int right)
{
    if(left < right)
    {
        int partitionIndex = partition(arr, left, right);
        quickSort(arr, left, partitionIndex-1);
        quickSort(arr, partitionIndex+1, right);
    }
}

int main()
{
    vector <long long> mergeDuration;
    vector <long long> quickDuration;
    for (int i = 5000; i<= 100000; i += 5000)
    {
        int* arr = generateArray(i);
        auto startTime = chrono::high_resolution_clock::now();
        quickSort(arr, 0, i - 1);
        auto endTime = chrono::high_resolution_clock::now();
        long long duration = chrono::duration_cast<chrono::milliseconds>(endTime - startTime).count();
        quickDuration.push_back(duration);
        delete[] arr;
    }
    for (int i = 5000; i <= 100000; i += 5000 )
    {
        int* arr = generateArray(i);
        auto startTime = chrono::high_resolution_clock::now();
        mergeSort(arr, i);
        auto endTime = chrono::high_resolution_clock::now();
        long long duration = chrono::duration_cast<chrono::milliseconds>(endTime - startTime).count();
        mergeDuration.push_back(duration);
        delete[] arr;
    }
    for (int i = 0; i<mergeDuration.size(); ++i)
    {
        cout << mergeDuration[i] << " ";
    }
    cout  << endl;
    for (int i = 0; i<quickDuration.size(); ++i)
    {
        cout << quickDuration[i] << " ";
    }
}

最佳答案

众所周知,当输入集包含大量重复项时,快速排序会表现不佳。解决方案是使用三向分区,如 Wikipedia 中所述。 :

Repeated elements

With a partitioning algorithm such as the ones described above (even with one that chooses good pivot values), quicksort exhibits poor performance for inputs that contain many repeated elements. The problem is clearly apparent when all the input elements are equal: at each recursion, the left partition is empty (no input values are less than the pivot), and the right partition has only decreased by one element (the pivot is removed). Consequently, the algorithm takes quadratic time to sort an array of equal values.

To solve this problem (sometimes called the Dutch national flag problem), an alternative linear-time partition routine can be used that separates the values into three groups: values less than the pivot, values equal to the pivot, and values greater than the pivot. ... The values equal to the pivot are already sorted, so only the less-than and greater-than partitions need to be recursively sorted. In pseudocode, the quicksort algorithm becomes

algorithm quicksort(A, lo, hi) is
    if lo < hi then
        p := pivot(A, lo, hi)
        left, right := partition(A, p, lo, hi)  // note: multiple return values
        quicksort(A, lo, left - 1)
        quicksort(A, right + 1, hi)

The partition algorithm returns indices to the first ('leftmost') and to the last ('rightmost') item of the middle partition. Every item of the partition is equal to p and is therefore sorted. Consequently, the items of the partition need not be included in the recursive calls to quicksort.

以下修改后的 quickSort 给出了更好的结果:

pair<int,int> partition(int* arr, int left, int right)
{
    int idx = left + (right - left) / 2;
    int pivot = arr[idx]; // to be improved to median-of-three
    int i = left, j = left, b = right - 1;
    while (j <= b) {
        auto x = arr[j];
        if (x < pivot) {
            swap(arr[i], arr[j]);
            i++;
            j++;
        } else if (x > pivot) {
            swap(arr[j], arr[b]);
            b--;
        } else {
            j++;
        }
    }
    return { i, j };
}
void quickSort(int* arr, int left, int right)
{
    if (left < right)
    {
        pair<int, int> part = partition(arr, left, right);
        quickSort(arr, left, part.first);
        quickSort(arr, part.second, right);
    }
}

输出:

0 1 2 3 4 5 6 7 8 9 11 11 12 13 14 15 16 19 18 19
0 0 0 1 0 1 1 1 1 1 2 3 2 2 2 2 3 3 3 3

0 1 2 3 4 5 6 6 8 8 9 12 11 12 13 14 16 17 18 19
0 0 1 1 1 2 3 3 3 4 4 4 5 6 5 6 7 7 8 8

因此,包含大量重复项的运行现在要快得多。

关于c++ - 为什么当数组有重复值时快速排序算法持续时间会增加?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/55415856/

相关文章:

algorithm - 为什么在平均情况下链排序 O(n sqrt n)?

c++ - C++ 中的嵌套可变参数函数

c++ - 使用 C++ 迭代器引用无效内存位置

c++ - steady_clock 的纪元是相对于操作系统启动的时间吗?还是过程本身?

c++ - 使用operator=和initializer_list找不到继承

javascript - 提高 html canvas mousemove 图像蒙版的性能

algorithm - 证明如果 g(n) 是 o(f(n)),则 f(n) + g(n) 是 Theta(f(n))

java - 锁定缓存键而不锁定整个缓存

java - 仅在首次访问时设置值——最佳实践、(微)性能?

algorithm - 使用 Trie 的时间复杂度为 O(m) 的单词搜索 - m 是单词的大小