optimization - 为什么 memset 很慢？

我的 CPU 的规范说它应该获得 5.336GB/s 的内存带宽。为了测试这一点，我编写了一个简单的程序，它在一个大数组上运行 memset(或 memcpy)并报告时间。我在 memset 上显示 3.8GB/s，在 memcpy 上显示 1.9GB/s。 http://en.wikipedia.org/wiki/Intel_Core_(microarchitecture)说我的 Q9400 应该是 5.336MB/s。怎么了？

我试过用赋值循环替换 memset 或 memcpy 。我已经四处搜索以尝试了解内存对齐。我尝试了不同的编译器标志。我已经在这上面花费了令人尴尬的几个小时。感谢您的任何帮助，您可以提供!

我正在使用 Ubuntu 12.04 和 libc-dev 版本 2.15-0ubuntu10.5 和内核 3.8.0-37-generic

编码:

#include <stdio.h>
#include <time.h>
#include <string.h>
#include <stdlib.h>

#define numBytes ((long)(1024*1024*1024))
#define numTransfers ((long)(8))

int main(int argc,char**argv){
    if(argc!=3){
        printf("Usage: %s BLOCK_SIZE_IN_BYTES NUMBER_OF_BLOCKS_TO_TRANSFER\n",argv[0]);
        return -1;
    }
    char*__restrict__ source=(char*)malloc(numBytes);
    char*__restrict__ dest=(char*)malloc(numBytes);
    struct timespec start,end;
    long totalTimeMs;
    int i;

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memset(source,0,numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memset %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s). ",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    clock_gettime(CLOCK_MONOTONIC_RAW,&start);
    for(i=0;i<numTransfers;++i)
        memcpy( dest, source, numBytes);
    clock_gettime(CLOCK_MONOTONIC_RAW,&end);
    totalTimeMs=(end.tv_nsec-start.tv_nsec)*.000001+1000*(end.tv_sec-start.tv_sec);
    printf("memcpy %ld bytes %ld times (%.2fGB total) in %ldms (%.3fGB/s).\n",numBytes,numTransfers,numBytes/1024.0/1024/1024*numTransfers,totalTimeMs,numBytes/1024.0/1024/1024*1000*numTransfers/totalTimeMs);

    free(source);
    free(dest);

    return EXIT_SUCCESS;
}

编译命令:

gcc -O3 -DNDEBUG -o memcpyStackOverflowNoParameters.c.o -c memcpyStackOverflowNoParameters.c
gcc -O3 -DNDEBUG memcpyStackOverflowNoParameters.c.o -o memcpy -rdynamic -lrt

示例输出:

memset 1073741824 bytes 8 times (8.00GB total) in 2214ms (3.880GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4466ms (1.923GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4557ms (1.885GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2222ms (3.866GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4433ms (1.938GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2216ms (3.876GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4521ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2217ms (3.875GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4520ms (1.900GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2218ms (3.873GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4430ms (1.939GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2226ms (3.859GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4444ms (1.933GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2225ms (3.861GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4485ms (1.915GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2620ms (3.279GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4855ms (1.769GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2535ms (3.389GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4870ms (1.764GB/s).
memset 1073741824 bytes 8 times (8.00GB total) in 2423ms (3.545GB/s). memcpy 1073741824 bytes 8 times (8.00GB total) in 4905ms (1.751GB/s).

根据 lshw 我的硬件:

  product: OptiPlex 960 ()
  vendor: Winbond Electronics
  width: 64 bits
*-core
     description: Motherboard
     product: 0Y958C
     vendor: Winbond Electronics
   *-firmware
        description: BIOS
        capabilities: pci pnp apm upgrade shadowing escd cdboot bootselect edd int13floppytoshiba int13floppy720 int5printscreen int9keyboard int14serial int17printer acpi usb biosbootspecification netboot
   *-cpu
        product: Intel(R) Core(TM)2 Quad CPU    Q9400  @ 2.66GHz
        physical id: 400
        size: 2666MHz
        width: 64 bits
        clock: 1333MHz
        capabilities: x86-64 fpu fpu_exception wp vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx constant_tsc arch_perfmon pebs bts rep_good nopl aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm sse4_1 xsave lahf_lm dtherm tpr_shadow vnmi flexpriority
        configuration: cores=4 enabledcores=4 threads=4
      *-cache:0
           description: L1 cache
           physical id: 700
           size: 256KiB
           capacity: 256KiB
           capabilities: internal write-back unified
      *-cache:1
           description: L2 cache
           physical id: 701
           size: 6MiB
           capacity: 6MiB
           capabilities: internal varies unified
   *-memory
        description: System Memory
        physical id: 1000
        slot: System board or motherboard
        size: 4GiB
      *-bank:0
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns)
           product: CT51264AA667.M16FC
           vendor: 7F7F7F7F7F9B0000
           slot: DIMM_1
           size: 4GiB
           width: 64 bits
           clock: 667MHz (1.5ns)
      *-bank:1
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:2
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]
      *-bank:3
           description: DIMM DDR2 Synchronous 667 MHz (1.5 ns) [empty]

最佳答案

内存地址是“虚拟化的”，您的程序使用的地址被转换为真实地址。这种转换使您可以从当时方便的任何部分中分配您的程序视为连续内存的部分。每个通用 CPU 都这样做。转换需要表查找，这需要内存访问。 CPU 有缓存，但是长的虚拟地址很容易破坏它的缓存，“TLB”(“翻译后备缓冲区”)。因此，每 4KB(在 Linux 系统上为 2MB，它会弄清楚您在做什么)，CPU 就会停止寻找真正发送内存流量的位置。这些摊位可能需要相当长的时间。您可以尝试运行基准测试的两个副本，TLB 未命中不会重合似乎是合理的，并且您将获得更接近额定容量的总带宽。

(编辑:嗯，您可能想将 #define 替换为

size_t numBytes=atoi(argv[1]);
size_t numTransfers=atoi(argv[2]);

在主体...)

编辑:顺便说一句:我在我的盒子上从这个测试中看到(并在评论中报告)的带宽远远低于我的 cpu 的额定容量，这让我调查了我自己的系统。我的盒子制造商在这些插槽中放入了非常糟糕的内存。我早就用知名品牌替换了它们，报告的吞吐量增加了一倍以上，并且非常明显地提高了我的机器的性能。

关于optimization - 为什么 memset 很慢？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23374286/

optimization - 为什么 memset 很慢？

上一篇：regex - 正则表达式如何在幕后工作(在 CPU 级别)？

下一篇：fortran - 如何将 Fortran c_ptr 与 null 进行比较