c++ - 奇怪的算法性能

对于上下文，我编写了此算法来获取任何字符串的唯一子字符串的数量。它为字符串建立后缀树，该字符串对包含的节点进行计数，并将其作为答案返回。我想解决的问题需要O(n)算法，因此，这个问题仅与代码的行为方式有关，而与代码的执行情况无关。

struct node{
    char value = ' ';
    vector<node*> children;
    ~node()
    {
        for (node* child: children)
        {
            delete child;
        }
    }
};

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        string tmp = aString.substr(i, aString.size());
        node* currentNode = root;
        char indexToNext = 0;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == tmp[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < tmp.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = tmp[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

我决定对这个算法进行基准测试，为此，我只需循环遍历一个大字符串，并在每次迭代中获取一个更大的子字符串，然后调用numberOfUniqueSusbstrings来测量结束所需的时间。

我将其绘制为 Octave ，这就是我得到的(x是字符串大小，y是时间(以微秒为单位))

Performance plot, x is string length and y is time in microseconds

我首先以为问题出在输入字符串中，但这只是我从书中得到的字母数字字符串(其他任何文本的行为也一样奇怪)。

还尝试对具有相同参数的函数的多次调用求平均，结果几乎相同。

这是使用g++ problem.cpp -std=c++14 -O3编译的，但似乎在-O2和-O0上也是如此。

编辑:
在@interjay回答之后，我尝试做的只是将函数保留为:

int numberOfUniqueSubstrings(string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        char indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}

确实确实使它更快。但对于我绘制此图也并不奇怪:

x = 1000发生了某些事情，我不知道这可能是什么。

另一个好的度量图:

我现在运行gprof来处理大小为999的字符串:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)
^L
            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------

对于大小为1001的字符串:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  us/call  us/call  name    
100.15      0.02     0.02      974    20.56    20.56  node::~node()
  0.00      0.02     0.00   498688     0.00     0.00  void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)
  0.00      0.02     0.00        1     0.00     0.00  _GLOBAL__sub_I__Z7imprimePK4node
  0.00      0.02     0.00        1     0.00     0.00  numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&)


            Call graph


granularity: each sample hit covers 2 byte(s) for 49.93% of 0.02 seconds

index % time    self  children    called     name
                               54285             node::~node() [1]
                0.02    0.00     974/974         test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[1]    100.0    0.02    0.00     974+54285   node::~node() [1]
                               54285             node::~node() [1]
-----------------------------------------------
                                                 <spontaneous>
[2]    100.0    0.00    0.02                 test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
                0.02    0.00     974/974         node::~node() [1]
                0.00    0.00       1/1           numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
-----------------------------------------------
                0.00    0.00  498688/498688      numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
[10]     0.0    0.00    0.00  498688         void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------
                0.00    0.00       1/1           __libc_csu_init [21]
[11]     0.0    0.00    0.00       1         _GLOBAL__sub_I__Z7imprimePK4node [11]
-----------------------------------------------
                0.00    0.00       1/1           test(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) [2]
[12]     0.0    0.00    0.00       1         numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [12]
                0.00    0.00  498688/498688      void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&) [10]
-----------------------------------------------


Index by function name

  [11] _GLOBAL__sub_I__Z7imprimePK4node [1] node::~node()
  [12] numberOfUniqueSubstrings(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, node*&) [10] void std::vector<node*, std::allocator<node*> >::_M_emplace_back_aux<node* const&>(node* const&)

但是，运行探查器似乎可以消除影响，并且两种情况下的时间几乎相同。

最佳答案

大多数人的工作假设似乎是将某种魔数(Magic Number)硬编码到库中，从而导致性能在999-1000左右发生相变(LSerni除外，后者预先观察到可能存在多个魔数(Magic Number))。

我将尝试系统地探讨此问题以及下面的其他一些假设(此答案结尾处提供了源代码)。

然后，我运行我的代码以查看是否可以在使用G++ 6.2.0-5ubuntu2作为-O3优化的编译器的情况下，在我的Intel(R)i5 CPU M480，Linux 4.8.0-34-通用计算机上复制您的结果。

果然，从999-1000(还有另一个接近1600的下降)有一个神奇的下降:

请注意，我的trans-1000数据集不如您的数据集整洁:这可能是因为我在计算机的后台运行其他一些东西，而您的测试环境却更安静。

我的下一个问题是:这个魔术1000数在环境之间稳定吗？

因此，我尝试使用G++ 4.9.2在Intel®Xeon®CPU E5-2680 v3，Linux 2.6.32-642.6.1.el6.x86_64计算机上运行代码。而且，毫不奇怪，魔术数是不同的，发生在975-976:

这告诉我们，如果有一个魔数(Magic Number)，它会在版本之间进行更改。由于某些原因，这降低了我对魔数(Magic Number)理论的信心。 (a)它改变了。 (b)1000 + 24字节的开销是魔术的不错选择。 975 + 49字节则少一些。 (c)第一个环境在较慢的处理器上具有更好的软件，但是第一个环境显示了我认为较差的性能:等到1000年才能加快速度。这似乎是一种回归。

我尝试了另一种测试:使用不同的随机输入数据运行程序。得出以下结果:

上图中的重点是999-1000下降不是很特殊。它看起来像之前的许多下降:速度缓慢降低，然后急剧提高。还值得注意的是，以前的许多液滴未对齐。

这向我暗示这是一种与输入有关的行为，并且运行之间存在相关性。因此，我想知道如果我通过随机化运行顺序来减少运行之间的相关性会发生什么。这给了:

999-1000左右仍在发生某些事情:

让我们进一步放大:

在使用较早版本软件的较快计算机上运行此命令可获得类似结果:

放大:

由于随机化考虑了不同长度的字符串的顺序实际上消除了运行之间的缓慢累积(上述相关性)，因此这表明您看到的现象需要某种全局状态。因此，C++字符串/vector 不能作为解释。因此，必须使用malloc，“OS”或体系结构约束。

请注意，当长度顺序是随机的时，在某个时候代码运行得较慢而不是更快。在我看来，这与超出某种高速缓存大小是一致的，但是信号中的噪声以及本文中的第一幅图也暗示可能存在内存碎片。因此，我决定在每次运行前重新启动程序，以确保获得新的堆。结果是:

现在我们看到不再有中断或跳跃。这表明缓存大小不是问题，而是观察到的行为与程序的整体内存使用有关。

另一个反对缓存效果的论点如下。这两台机器都具有32kB和256kB的L1和L2缓存，因此它们的缓存性能应该相似。我的慢速计算机具有3,072kB的L3缓存。如果您假设每个分配的页面大小为4kB，则1000个节点会分配4,000kB的空间，这接近缓存的大小。但是，该快速计算机具有30,720kB L3高速缓存，并在975处显示中断。如果该现象是一种高速缓存效应，则您希望该中断(如果有的话)稍后出现。因此，我很确定缓存在这里不起作用。

唯一剩下的元凶是malloc。

为什么会这样呢？我不知道。但是，作为程序员，我不在乎如下。

对此可能有一个解释，但是它的层次太深，无法更改或真正担心。我可以采取一些异国情调的方法来修复它，但这需要考虑其黑暗的下腹部的某个地方发生了什么。除非确实需要，否则我们专门使用诸如C++之类的高级语言来避免弄乱这些细节。

我的结果表明，在这种情况下我们不必这样做。 (a)最后一张图告诉我们，任何独立的代码运行都可能表现出接近最佳的行为；(b)随机化顺序运行可以提高性能，并且(c)效率损失约为百分之一。一秒钟，这是完全可以接受的，除非您要处理大量数据。

源代码如下。请注意，该代码将您版本的char indexToNext更改为int indexToNext，从而修复了可能的整数溢出问题。测试interjay's suggestion避免复制字符串实际上会导致性能下降。

#include <string>
#include <chrono>
#include <cstdlib>
#include <iostream>
#include <vector>
#include <time.h>
#include <algorithm>

struct profiler
{
  std::string name;
  std::chrono::high_resolution_clock::time_point p;
  profiler(std::string const &n) :
      name(n), p(std::chrono::high_resolution_clock::now()) { }
  ~profiler()
  {
      using dura = std::chrono::duration<double>;
      auto d = std::chrono::high_resolution_clock::now() - p;
      std::cout //<< name << ": "
          << std::chrono::duration_cast<dura>(d).count()
          << std::endl;
  }
};

#define PROFILE_BLOCK(pbn) profiler _pfinstance(pbn)

struct node {
  char value = ' ';
  std::vector<node*> children;
  ~node(){
    for (node* child: children)
      delete child;
  }
};

int numberOfUniqueSubstrings(const std::string aString, node*& root)
{
    root = new node();
    int substrings = 0;
    for (int i = 0; i < aString.size(); ++i)
    {
        node* currentNode = root;
        int indexToNext = i;
        for (int j = 0; j < currentNode->children.size(); ++j)
        {
            if (currentNode->children[j]->value == aString[indexToNext])
            {
                currentNode = currentNode->children[j];
                j = -1;
                indexToNext++;
            }
        }
        for (int j = indexToNext; j < aString.size(); ++j)
        {
            node* theNewNode  = new node;
            theNewNode->value = aString[j];
            currentNode->children.push_back(theNewNode);
            currentNode = theNewNode;
            substrings++;
        }
    }
    return substrings;
}


int main(int argc, char **argv){
  const int MAX_LEN = 1300;

  if(argc==1){
    std::cerr<<"Syntax: "<<argv[0]<<"<SEED> [LENGTH]"<<std::endl;
    std::cerr<<"Seed of -1 implies all lengths should be explore and input randomized from time."<<std::endl;
    std::cerr<<"Positive seed sets the seed and explores a single input of LENGTH"<<std::endl;
    return -1;
  }

  int seed = std::stoi(argv[1]);

  if(seed==-1)
    srand(time(NULL));
  else
    srand(seed);

  //Generate a random string of the appropriate length
  std::string a;
  for(int fill=0;fill<MAX_LEN;fill++)
      a.push_back('a'+rand()%26);

  //Generate a list of lengths of strings to experiment with
  std::vector<int> lengths_to_try;
  if(seed==-1){
    for(int i=1;i<MAX_LEN;i++)
      lengths_to_try.push_back(i);
  } else {  
    lengths_to_try.push_back(std::stoi(argv[2]));
  }

  //Enable this line to randomly sort the strings
  std::random_shuffle(lengths_to_try.begin(),lengths_to_try.end());

  for(auto len: lengths_to_try){
    std::string test(a.begin(),a.begin()+len);

    std::cout<<len<<" ";
    {
      PROFILE_BLOCK("Some time");
      node *n;
      int c = numberOfUniqueSubstrings(test,n);
      delete n;
    }
  }
}

substr是“常量”

OP的原始代码包括以下内容:

for (int i = 0; i < aString.size(); ++i)
{
  string tmp = aString.substr(i, aString.size());

此处的substr操作占用字符串长度的O(n)时间。在an answer below中，有人认为此O(n)操作导致OP原始代码的性能较差。

我不同意这种评估。由于缓存和SIMD操作，CPU最多可以以64个字节(或更多个字节)的块形式读取和复制数据。因此，内存分配的成本可以控制复制字符串的成本。因此，对于OP的输入大小，substr操作的行为更像是一个昂贵的常量，而不是附加的循环。

这可以通过使用以下代码编译代码进行测试来证明。 g++ temp.cpp -O3 --std=c++14 -g并进行分析，例如sudo operf ./a.out -1。生成的时间使用配置文件如下所示:

25.24%  a.out    a.out                [.] _ZN4nodeD2Ev        #Node destruction                                                                           
24.77%  a.out    libc-2.24.so         [.] _int_malloc                                                                                    
13.93%  a.out    libc-2.24.so         [.] malloc_consolidate                                                                            
11.06%  a.out    libc-2.24.so         [.] _int_free                                                                                      
 7.39%  a.out    libc-2.24.so         [.] malloc                                                                                        
 5.62%  a.out    libc-2.24.so         [.] free                                                                                          
 3.92%  a.out    a.out                [.] _ZNSt6vectorIP4nodeSaIS1_EE19_M_emplace_back_auxIJRKS1_EEEvDpOT_                              
 2.68%  a.out    a.out                [.]
 8.07%  OTHER STUFF

从中可以明显看出，内存管理主导着运行时间。

关于c++ - 奇怪的算法性能，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/41795109/

c++ - 奇怪的算法性能

上一篇：java - 使用 Java Spring Controller 从表单读取值

下一篇：java - 用Java模拟cURL