c++ - 基数排序算法说明

标签 c++ sorting radix-sort

我是新手,我正在寻找C++中的基数排序实现,我发现了这一点
代码在这里。

void countSort(string a[], int size, size_t k)
{
    string *b = NULL; int *c = NULL;
    b = new string[size];
    c = new int[257];

    for (int i = 0; i <257; i++){
        c[i] = 0;   
    }

    for (int j = 0; j <size; j++){   
        c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
        //a[j] is a string
    }

    for (int f = 1; f <257; f++){
        c[f] += c[f - 1];
    }

    for (int r = size - 1; r >= 0; r--){
        b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
        c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;
    }

    for (int l = 0; l < size; l++){
        a[l] = b[l];
    }

    // avold memory leak
    delete[] b;
    delete[] c;
}
void radixSort(string b[], int r)
{
    size_t max = getMax(b, r);
    for (size_t digit = max; digit > 0; digit--){ 
        countSort(b, r, digit - 1);
    }
}
所以我的问题是这些行是做什么的:
c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0]--;
是MSD还是LSD基数排序?
谢谢。

最佳答案

这是不必要的紧凑形式的整洁示例,因此很难阅读代码。
要对其进行分析有助于将其分开:

// what a mess...
c[k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0]++;
首先取出c的订阅参数:
// determine index for c
const int iC = k < a[j].size() ? (int)(unsigned char)a[j][k] + 1 : 0;
// post-increment c (as it is it could become a pre-increment as well)
c[iC]++;
索引计算包含一个条件:
// determine index for c
const int iC
  // check whether k is (not) exceeding the size of a
  = k < a[j].size()
  // then 
  ? (int)(unsigned char)a[j][k] + 1
  // else
  : 0;
数组astd::string的数组,其中std::string本身包含char数组。因此,a[j][k]产生单个charchar可以是有符号的或无符号的-留给编译器。因此,(unsigned char)a[j][k]不会更改该char的位,而是将它们解释为无符号数字。然后(int)(unsigned char)a[j][k]将其提升为int
请注意,如果当前编译器已签名(int)a[j][k],则此名称可能与char不同,因为在这种情况下,将保留值的可能符号。 (这称为sign extension。)因此,整个过程仅负责将当前字符转换为(正)索引并最终加1。

实际上,我打算将其余内容留给读者练习,但是后来我看到了:
b[c[k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0] - 1] = a[r];
像上面一样将其分离,结果为:
const int iC = k < a[r].size() ? (int)(unsigned char)a[r][k] + 1 : 0;
const int iB = c[iC - 1]; // What?
b[iB] = a[r];
考虑到iC可能会导致0(虽然我没有检查整个代码是否完全可行),所以iC - 1可能会导致-1。因此,可以访问c[-1]
如果例如c,其中的指针指向更大的数组,但不在数组的开头。因此,负索引将访问有效存储。这里似乎不是这样:
c = new int[257];
而且我看不到对c的任何其他分配。
这一切看起来并不值得信赖。最好情况下,该条件过于悲观,并且永远不会分配0。

我非常确定我可以证明,如果紧凑的代码无助于更轻松地发现其中的潜在问题,则可以提高可读性。
那么,非紧凑代码会更慢吗?
根据我的经验,其惊人的优化功能并不适用于现代编译器。
我曾经读过一篇有关优化和Static single assignment form的文章。
同样,当我调试C++代码(肯定不包含任何名为$$的变量)时,我会不时在Visual Studios调试器监视窗口中看到所有有趣的$$变量。
因此,我相信编译器也会在内部做类似的事情。 –明确地执行此操作以提高可读性应该不会对性能产生最小的影响。
如果我真的有疑问,我仍然可以检查汇编器输出。
(例如Compiler Explorer是一个好地方。)

顺便说一句。 c = new int[257];
为什么不使用int c[257];
257个int值并不太多,我担心会立即超过堆栈大小。
更不用说,数组,尤其是分配了new的数组,实际上是糟糕的C++风格,需要U.B.。好像std::vector尚未被发明…

当我还是学生的时候,我就以某种方式错过了有关Radix排序的类(class)(尽管我必须承认,我还没有在日常业务中错过这一知识)。
因此,出于好奇,我浏览了Wikipedia,并重新实现了那里的描述。
旨在提供(希望更好)替换问题中发现和公开的OP。
因此,我实现了
  • 根据en.wikipedia.org: Radix sort – History上的描述进行幼稚的方法
  • ,然后OP显示我在de.wikipedia.org: Countingsort – Algorithmus上发现的方法(带有计数排序)。
  • #include <iostream>
    #include <sstream>
    #include <string>
    #include <vector>
    
    /* helper to find max. length in data strings
     */
    size_t maxLength(const std::vector<std::string> &data)
    {
      size_t lenMax = 0;
      for (const std::string &value : data) {
        if (lenMax < value.size()) lenMax = value.size();
      }
      return lenMax;
    }
    
    /* a naive implementation of radix sort
     * like described in https://en.wikipedia.org/wiki/Radix_sort
     */
    void radixSort(std::vector<std::string> &data)
    {
      /* A char has 8 bits - which encode (unsigned) the numbers of [0, 255].
       * Hence, 256 buckets are used for sorting.
       */
      std::vector<std::string> buckets[256];
      // determine max. length of input data:
      const size_t len = maxLength(data);
      /* iterate over data for according to max. length
       */
      for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
        // sort data into buckets according to the current "digit":
        for (std::string &value : data) {
          /* digits after end of string are considered as '\0'
           * because 0 is the usual end-marker of C strings
           * and the least possible value of an unsigned char.
           * This shall ensure that an string goes before a longer
           * string with same prefix.
           */
          const unsigned char digit = i < value.size() ? value[i] : '\0';
          // move current string into the corresponding bucket
          buckets[digit].push_back(std::move(value));
        }
        // store buckets back into data (preserving current order)
        data.clear();
        for (std::vector<std::string> &bucket : buckets) {
          // append bucket to the data
          data.insert(data.end(),
            std::make_move_iterator(bucket.begin()),
            std::make_move_iterator(bucket.end()));
          bucket.clear();
        }
      }
    }
    
    /* counting sort as helper for the not so naive radix sort
     */
    void countSort(std::vector<std::string> &data, size_t i)
    {
      /* There are 256 possible values for an unsigned char
       * (which may have a value in [0, 255]).
       */
      size_t counts[256] = { 0 }; // initialize all counters with 0.
      // count how often a certain charater appears at the place i
      for (const std::string &value : data) {
        /* digits after end of string are considered as '\0'
         * because 0 is the usual end-marker of C strings
         * and the least possible value of an unsigned char.
         * This shall ensure that an string goes before a longer
         * string with same prefix.
         */
        const unsigned char digit = i < value.size() ? value[i] : '\0';
        // count the resp. bucket counter
        ++counts[digit];
      }
      // turn counts of digits into offsets in data
      size_t total = 0;
      for (size_t &count : counts) {
    #if 0 // could be compact (and, maybe, confusing):
        total = count += total; // as C++ assignment is right-associative
    #else // but is the same as:
        count += total; // add previous total sum to count
        total = count; // remember new total
    #endif // 0
      }
      // an auxiliary buffer to sort the input data into.
      std::vector<std::string> buffer(data.size());
      /* Move input into aux. buffer
       * while using the bucket offsets (the former counts)
       * for addressing of new positions.
       * This is done backwards intentionally as the offsets
       * are decremented from end to begin of partitions.
       */
      for (size_t j = data.size(); j--;) { // j-- -> check for 0 and post-decrement
        std::string &value = data[j];
        // see comment for digit above...
        const unsigned char digit = i < value.size() ? value[i] : '\0';
        /* decrement offset and use as index
         * Arrays (and vectors) in C++ are 0-based.
         * Hence, this is adjusted respectively (compared to the source of algorithm).
         */
        const size_t k = --counts[digit];
        // move input element into auxiliary buffer at the determined offset
        buffer[k] = std::move(value);
      }
      /* That's it.
       * Move aux. buffer back into data.
       */
      data = std::move(buffer);
    }
    
    /* radix sort using count sort internally
     */
    void radixCountSort(std::vector<std::string> &data)
    {
      // determine max. length of input data:
      const size_t len = maxLength(data);
      /* iterate over data according to max. length
       */
      for (size_t i = len; i--;) { // i-- -> check for 0 and post-decrement
        countSort(data, i);
      }
    }
    
    /* output of vector with strings
     */
    std::ostream& operator<<(std::ostream &out, const std::vector<std::string> &data)
    {
      const char *sep = " ";
      for (const std::string &value : data) {
        out << sep << '"' << value << '"';
        sep = ", ";
      }
      return out;
    }
    
    /* do a test for certain data
     */
    void test(const std::vector<std::string> &data)
    {
      std::cout << "Data: {" << data << " }\n";
      std::vector<std::string> data1 = data;
      radixSort(data1);
      std::cout << "Radix Sorted:       {" << data1 << " }\n";
      std::vector<std::string> data2 = data;
      radixCountSort(data2);
      std::cout << "Radix Count Sorted: {" << data2 << " }\n";
    }
    
    /* helper to turn a text into a vector of strings
     * (by separating at white spaces)
     */
    std::vector<std::string> tokenize(const char *text)
    {
      std::istringstream in(text);
      std::vector<std::string> tokens;
      for (std::string token; in >> token;) tokens.push_back(token);
      return tokens;
    }
    
    /* main program
     */
    int main()
    {
      // do some tests:
      test({ "Hi", "He", "Hello", "World", "Wide", "Web" });
      test({ });
      test(
        tokenize(
          "Radix sort dates back as far as 1887 to the work of Herman Hollerith on tabulating machines.\n"
          "Radix sorting algorithms came into common use as a way to sort punched cards as early as 1923.\n"
          "The first memory-efficient computer algorithm was developed in 1954 at MIT by Harold H. Seward.\n"
          "Computerized radix sorts had previously been dismissed as impractical "
          "because of the perceived need for variable allocation of buckets of unknown size.\n"
          "Seward's innovation was to use a linear scan to determine the required bucket sizes and offsets beforehand, "
          "allowing for a single static allocation of auxiliary memory.\n"
          "The linear scan is closely related to Seward's other algorithm - counting sort."));
    }
    
    输出:
    Data: { "Hi", "He", "Hello", "World", "Wide", "Web" }
    Radix Sorted:       { "He", "Hello", "Hi", "Web", "Wide", "World" }
    Radix Count Sorted: { "He", "Hello", "Hi", "Web", "Wide", "World" }
    Data: { }
    Radix Sorted:       { }
    Radix Count Sorted: { }
    Data: { "Radix", "sort", "dates", "back", "as", "far", "as", "1887", "to", "the", "work", "of", "Herman", "Hollerith", "on", "tabulating", "machines.", "Radix", "sorting", "algorithms", "came", "into", "common", "use", "as", "a", "way", "to", "sort", "punched", "cards", "as", "early", "as", "1923.", "The", "first", "memory-efficient", "computer", "algorithm", "was", "developed", "in", "1954", "at", "MIT", "by", "Harold", "H.", "Seward.", "Computerized", "radix", "sorts", "had", "previously", "been", "dismissed", "as", "impractical", "because", "of", "the", "perceived", "need", "for", "variable", "allocation", "of", "buckets", "of", "unknown", "size.", "Seward's", "innovation", "was", "to", "use", "a", "linear", "scan", "to", "determine", "the", "required", "bucket", "sizes", "and", "offsets", "beforehand,", "allowing", "for", "a", "single", "static", "allocation", "of", "auxiliary", "memory.", "The", "linear", "scan", "is", "closely", "related", "to", "Seward's", "other", "algorithm", "-", "counting", "sort." }
    Radix Sorted:       { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }
    Radix Count Sorted: { "-", "1887", "1923.", "1954", "Computerized", "H.", "Harold", "Herman", "Hollerith", "MIT", "Radix", "Radix", "Seward's", "Seward's", "Seward.", "The", "The", "a", "a", "a", "algorithm", "algorithm", "algorithms", "allocation", "allocation", "allowing", "and", "as", "as", "as", "as", "as", "as", "at", "auxiliary", "back", "because", "been", "beforehand,", "bucket", "buckets", "by", "came", "cards", "closely", "common", "computer", "counting", "dates", "determine", "developed", "dismissed", "early", "far", "first", "for", "for", "had", "impractical", "in", "innovation", "into", "is", "linear", "linear", "machines.", "memory-efficient", "memory.", "need", "of", "of", "of", "of", "of", "offsets", "on", "other", "perceived", "previously", "punched", "radix", "related", "required", "scan", "scan", "single", "size.", "sizes", "sort", "sort", "sort.", "sorting", "sorts", "static", "tabulating", "the", "the", "the", "to", "to", "to", "to", "to", "unknown", "use", "use", "variable", "was", "was", "way", "work" }
    
    Live Demo on coliru
    请注意,字符串是按照解释字符的数值进行排序的。
    如果相反打算使用英语词典排序,则必须修改数字到存储桶的映射。因此,字符值的顺序可能会更改,并且会将相应的大写和小写字符映射到同一存储桶。
    经常复制字符串(或其他容器)会浪费时间和空间,在某种程度上,我充其量会阻止生产代码。
    move semantics是一种降低CPU压力的选项,同时保持代码非常干净并与后面的算法相当。
    这是我试图(据我所知)在示例代码中考虑的内容。

    关于c++ - 基数排序算法说明,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/64148671/

    相关文章:

    php - 使用 PHP 将数据排序为多维分层数组

    algorithm - "In-place"MSD 基数排序、堆栈空间和 Stack Overflow 的

    algorithm - 最重要的对比最低有效基数排序

    c++ - 执行 RSA 解密时密文无效

    c++ - MSVC 2015 SFINAE 未能找到合适的成员函数

    c++ - Qt3d/C++ - 如何使用 frameGraphe 来实现轮廓?

    用于设置核心亲和性的 C++ 风格?

    sorting - javamail imap如何通过接收日期desc获取邮件订单

    sorting - 在 SAS 中,有没有办法使用除一个之外的所有变量进行排序?

    c++ - 如何优化间接基数排序? (又名如何优化不可预测的内存访问模式)