c++ - 哈希表的性能，为什么C++最慢？

对 hash 进行了简单的性能测试，似乎 C++ 版本比 perl 版本和 golang 版本都慢。

perl 版本花费了大约 200 毫秒，
C++ 版本耗时 280 毫秒。
golang 版本耗时 56 毫秒。

在配备 Core(TM) i7-2670QM CPU @ 2.20GHz、Ubuntu 14.04.3LTS 的 PC 上，

有什么想法吗？

perl 版本

use Time::HiRes qw( usleep ualarm gettimeofday tv_interval nanosleep
                      clock_gettime clock_getres clock_nanosleep clock
                      stat );
sub getTS {
    my ($seconds, $microseconds) = gettimeofday;
    return $seconds + (0.0+ $microseconds)/1000000.0;
}
my %mymap;
$mymap{"U.S."} = "Washington";
$mymap{"U.K."} = "London";
$mymap{"France"} = "Paris";
$mymap{"Russia"} = "Moscow";
$mymap{"China"} = "Beijing";
$mymap{"Germany"} = "Berlin";
$mymap{"Japan"} = "Tokyo";
$mymap{"China"} = "Beijing";
$mymap{"Italy"} = "Rome";
$mymap{"Spain"} = "Madrad";
$x = "";
$start = getTS();
for ($i=0; $i<1000000; $i++) {
    $x = $mymap{"China"};
}
printf "took %f sec\n", getTS() - $start;

C++ 版本

#include <iostream>
#include <string>
#include <unordered_map>
#include <sys/time.h>

double getTS() {
    struct timeval tv;
    gettimeofday(&tv, NULL);
    return tv.tv_sec + tv.tv_usec/1000000.0;
}
using namespace std;
int main () {
  std::unordered_map<std::string,std::string> mymap;

  // populating container:
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";  

  double start = getTS();
  string x;
  for (int i=0; i<1000000; i++) {
      mymap["China"];
  }
  printf("took %f sec\n", getTS() - start);
  return 0;
}

Go 语言版本

package main

import "fmt"
import "time"

func main() {
    var x string
    mymap := make(map[string]string)
    mymap["U.S."] = "Washington";
    mymap["U.K."] = "London";
    mymap["France"] = "Paris";
    mymap["Russia"] = "Moscow";
    mymap["China"] = "Beijing";
    mymap["Germany"] = "Berlin";
    mymap["Japan"] = "Tokyo";
    mymap["China"] = "Beijing";
    mymap["Italy"] = "Rome";
    mymap["Spain"] = "Madrad";
    t0 := time.Now()
    sum := 1
    for sum < 1000000 {
        x = mymap["China"]
        sum += 1
    }
    t1 := time.Now()
    fmt.Printf("The call took %v to run.\n", t1.Sub(t0))
    fmt.Println(x)
}

更新 1

为了改进 C++ 版本，更改了 x = mymap["China"];至 mymap["China"]; ，但性能差别很小。

更新 2

我在没有任何优化的情况下编译时得到了原始结果:g++ -std=c++11 unorderedMap.cc .使用“-O2”优化，只花费一半左右的时间(150ms)

更新 3

删除可能的char*至 string构造函数调用，我创建了一个字符串常量。时间减少到大约 220 毫秒(编译时没有优化)。感谢@neil-kirk 的建议，经过优化(-O2 标志)，时间约为 80 毫秒。

  double start = getTS();
  string x = "China";
  for (int i=0; i<1000000; i++) {
      mymap[x];
  }

更新 4

感谢 @steffen-ullrich 指出 perl 版本存在语法错误。我改变了它。性能数约为150ms。

更新 5

看来执行指令的数量很重要。使用命令 valgrind --tool=cachegrind <cmd>

Go 版本

$ valgrind --tool=cachegrind  ./te1
==2103== Cachegrind, a cache and branch-prediction profiler
==2103== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2103== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2103== Command: ./te1
==2103== 
--2103-- warning: L3 cache found, using its data for the LL simulation.
The call took 1.647099s to run.
Beijing
==2103== 
==2103== I   refs:      255,763,381
==2103== I1  misses:          3,709
==2103== LLi misses:          2,743
==2103== I1  miss rate:        0.00%
==2103== LLi miss rate:        0.00%
==2103== 
==2103== D   refs:      109,437,132  (77,838,331 rd   + 31,598,801 wr)
==2103== D1  misses:        352,474  (   254,714 rd   +     97,760 wr)
==2103== LLd misses:        149,260  (    96,250 rd   +     53,010 wr)
==2103== D1  miss rate:         0.3% (       0.3%     +        0.3%  )
==2103== LLd miss rate:         0.1% (       0.1%     +        0.1%  )
==2103== 
==2103== LL refs:           356,183  (   258,423 rd   +     97,760 wr)
==2103== LL misses:         152,003  (    98,993 rd   +     53,010 wr)
==2103== LL miss rate:          0.0% (       0.0%     +        0.1%  )

针对C++优化版(无优化标志)

$ valgrind --tool=cachegrind ./a.out
==2180== Cachegrind, a cache and branch-prediction profiler
==2180== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2180== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2180== Command: ./a.out
==2180== 
--2180-- warning: L3 cache found, using its data for the LL simulation.
took 64.657681 sec
==2180== 
==2180== I   refs:      5,281,474,482
==2180== I1  misses:            1,710
==2180== LLi misses:            1,651
==2180== I1  miss rate:          0.00%
==2180== LLi miss rate:          0.00%
==2180== 
==2180== D   refs:      3,170,495,683  (1,840,363,429 rd   + 1,330,132,254 wr)
==2180== D1  misses:           12,055  (       10,374 rd   +         1,681 wr)
==2180== LLd misses:            7,383  (        6,132 rd   +         1,251 wr)
==2180== D1  miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== LLd miss rate:           0.0% (          0.0%     +           0.0%  )
==2180== 
==2180== LL refs:              13,765  (       12,084 rd   +         1,681 wr)
==2180== LL misses:             9,034  (        7,783 rd   +         1,251 wr)
==2180== LL miss rate:            0.0% (          0.0%     +           0.0%  )

C++优化版

$ valgrind --tool=cachegrind ./a.out
==2157== Cachegrind, a cache and branch-prediction profiler
==2157== Copyright (C) 2002-2013, and GNU GPL'd, by Nicholas Nethercote et al.
==2157== Using Valgrind-3.10.0.SVN and LibVEX; rerun with -h for copyright info
==2157== Command: ./a.out
==2157== 
--2157-- warning: L3 cache found, using its data for the LL simulation.
took 9.419447 sec
==2157== 
==2157== I   refs:      1,451,459,660
==2157== I1  misses:            1,599
==2157== LLi misses:            1,549
==2157== I1  miss rate:          0.00%
==2157== LLi miss rate:          0.00%
==2157== 
==2157== D   refs:        430,486,197  (340,358,108 rd   + 90,128,089 wr)
==2157== D1  misses:           12,008  (     10,337 rd   +      1,671 wr)
==2157== LLd misses:            7,372  (      6,120 rd   +      1,252 wr)
==2157== D1  miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== LLd miss rate:           0.0% (        0.0%     +        0.0%  )
==2157== 
==2157== LL refs:              13,607  (     11,936 rd   +      1,671 wr)
==2157== LL misses:             8,921  (      7,669 rd   +      1,252 wr)
==2157== LL miss rate:            0.0% (        0.0%     +        0.0%  )

最佳答案

来自您的 Perl 代码(在您尝试修复它之前):

@mymap = ();
$mymap["U.S."] = "Washington";
$mymap["U.K."] = "London";

这不是 map 而是数组。 HashMap 的语法是:

  %mymap;
  $mymap{"U.S."} = ....

因此，您有效地做的是创建一个数组而不是 HashMap ，并始终访问元素 0。请在 Perl 中始终使用 use strict; 和 use warnings;，即使是简单的带有警告的语法检查也会向您显示问题所在:

perl -cw x.pl
Argument "U.S." isn't numeric in array element at x.pl line 9.
Argument "U.K." isn't numeric in array element at x.pl line 10.

除此之外，基准测试的主要部分实际上没有做任何有用的事情(分配一个变量并且从不使用它)并且一些编译器可以检测到它并简单地优化它。

如果您检查 Perl 程序生成的代码，您会看到:

$ perl -MO=Deparse x.pl
@mymap = ();
$mymap[0] = 'Washington';
$mymap[0] = 'London';
...
for ($i = 0; $i < 1000000; ++$i) {
    $x = $mymap[0];
}

那是它在编译时检测到问题并用对数组索引 0 的访问替换它。

因此，无论何时进行基准测试，您都需要:

检查您的程序是否确实按照预期进行。
检查编译器是否没有优化某些东西，也没有在编译时执行其他语言在运行时执行的东西。任何类型的没有结果或可预测结果的语句都容易进行此类优化。
确认您确实衡量了您打算衡量的内容，而不是其他内容。即使对程序进行微小的更改也会影响运行时间，因为需要内存分配，而这些更改可能与您打算测量的内容无关。在您的基准测试中，您一次又一次地测量对同一散列元素的访问，而不访问其间的其他元素。这是一项可以很好地使用处理器缓存的事件，但可能无法反射(reflect)真实世界的使用情况。

而且，使用简单的计时器并不是一个现实的基准。系统上还有其他进程，有调度程序，有缓存垃圾……对于今天的 CPU，它在很大程度上取决于系统上的负载，因为 CPU 可能会以比其他基准测试更低的功耗模式运行一个基准测试，即使用不同的 CPU 时钟。例如，同一“基准测试”的多次运行在我的系统上的测量时间在 100 毫秒到 150 毫秒之间变化。

基准是谎言，像你这样的微观基准更是如此。

关于c++ - 哈希表的性能，为什么C++最慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/33950565/

c++ - 哈希表的性能，为什么C++最慢？

上一篇：json - 如何在 Golang 中合并两个结构？

下一篇：unit-testing - Go 单元测试中的自定义命令行标志