C 函数刷新所有包含数组的缓存行

我试图强制用户应用程序从所有级别的缓存中刷新所有保存数组(由自己创建)的缓存行。
在阅读了这篇文章 ( Cflush to invalidate cache line via C function ) 并得到了@PeterCordes 的大力指导后，我试图用 C 语言想出一个函数来实现这一点。

#include <x86intrin.h>
#include <stdint.h>

inline void flush_cache_range(uint64_t *ptr, size_t len){
    size_t i;
    // prevent any load or store to be scheduled across 
    // this point due to CPU Out of Order execution.
    _mm_mfence();
    for(i=0; i<len; i++)
        // flush the cache line that contains ptr+i from 
        // all cache levels
        _mm_clflushopt(ptr+i); 
    _mm_mfence();
}

int main(){
    size_t plen = 131072; // or much bigger
    uint64_t *p = calloc(plen,sizeof(uint64_t));
    for(size_t i=0; i<plen; i++){
        p[i] = i;
    }
    flush_cache_range(p,plen);
    // at this point, accessing any element of p should
    // cause a cache miss. As I access them, adjacent
    // elements and perhaps pre-fetched ones will come
    // along.
    (...)
    return 0;
}

我正在编译 gcc -O3 -march=native source.c -o exec.bin在运行内核 5.11.14 (Fedora 33) 的 AMD Zen2 处理器中。
我不完全理解 mfence 之间的区别/sfence/lfence ，或者当一个或另一个就足够了，所以我只用了 mfence因为我相信它施加了最强的限制(我说得对吗？)。
我的问题是:我在这个实现中遗漏了什么吗？它会做我想象的那样吗？ (我的想象是在调用 flush_cache_range 函数后的评论中)
谢谢。

编辑 1:每行冲洗一次，并移除围栏。
在@PeterCordes 的回答之后，我正在做一些调整:

首先，该函数接收一个指向 char 的指针及其大小(以字符为单位)，因为它们是 1 个字节长，所以我可以控制从一个刷新跳到下一个刷新的大小。

然后，我需要确认缓存行的大小。我可以使用程序获取该信息 cpuid :cpuid -1 | grep -A12 -e "--- cache [0-9] ---"对于 L1i、L1d、L2 和 L3，我得到 line size in bytes = 0x40 (64)所以这是每次刷新后我必须跳过的字节数。

然后我通过添加 ptr + len - 1 来确定指向最后一个字符的指针.

并遍历所有地址，每个缓存行一个，包括最后一个( ptr_end )。

这是代码的更新版本:

#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>

inline void flush_cache_range(char *ptr, size_t len);

void flush_cache_range(char *ptr, size_t len){
    const unsigned char cacheline = 64;
    char *ptr_end = ptr + len - 1;
    while(ptr <= ptr_end){
        _mm_clflushopt(ptr);
        ptr += cacheline;
    }
}

int main(){
    size_t i, sum=0, plen = 131072000; // or much bigger
    uint64_t *p = calloc(plen,sizeof(uint64_t));
    for(i=0; i<plen; i++){
        p[i] = i;
    }
    flush_cache_range((char*)p, sizeof(p[0])*plen);
    // there should be many cache misses now
    for(i=0; i<plen; i++){
        sum += p[i];
    }
    printf("sum is:%lu\n", sum);
    return 0;
}

现在当我编译并运行时 perf :gcc -O3 -march=native src/source.c -o exec.bin && perf stat -e cache-misses,cache-references ./exec.bin我得到:

sum is:8589934526464000
 Performance counter stats for './exec.bin':

         1,202,714      cache-misses:u # 1.570 % of all cache refs    
        76,612,476      cache-references:u                                          
       0.377100534 seconds time elapsed
       0.170473000 seconds user
       0.205574000 seconds sys

如果我评论该行调用 flush_cache_range ，我得到几乎相同的:

sum is:8589934526464000

 Performance counter stats for './exec.bin':
         1,211,462      cache-misses:u # 1.590 % of all cache refs    
        76,202,685      cache-references:u                                          
       0.356544645 seconds time elapsed
       0.160227000 seconds user
       0.195305000 seconds sys

我错过了什么？

编辑 2:添加 sfence , 并修复循环限制

我按照@prl

的建议添加了围栏

将 ptr_end 更改为指向其缓存行的最后一个字节。

void flush_cache_range(char *ptr, size_t len){
    const unsigned char cacheline = 64;
    char *ptr_end = (char*)(((size_t)ptr + len - 1) | (cacheline - 1));

    while(ptr <= ptr_end){
        _mm_clflushopt(ptr);
        ptr += cacheline;
    }
    _mm_sfence();
}

我仍然在 perf 中得到相同的意外结果。

最佳答案

是的，这看起 %ifdef __YASM_VER__ CPU Conroe AMD CPU Skylake AMD %else %use smartalign alignmode p6, 64 %endif global _start _start: %if 1 lea rdi, [buf] lea rsi, [bufsrc] %endif mov ebp, 10000000 mov [rdi], eax mov [rdi+4096], edx align 64 .loop: mov eax, [rdi] ; mov eax, [rdi+8] clflushopt [rdi] sfence ;mfence mov eax, [rdi+16] ; mov eax, [rdi+24] add rdi, 64 and rdi, -(1<<14) dec ebp jnz .loop .end: xor edi,edi mov eax,231 syscall ; sys_exit_group(0) section .bss align 4096 buf: resb 4096*4096 bufsrc: resb 4096 resb 100

 + nasm -felf64 + ld -o testloop testloop.o

... 0000000000401040 <_start.loop>: 401040:       8b 07 401042: 401046:       0f ae f8 401049:       8b 47 10 40104c: 401050: 401057:       ff cd 401059:       75 e5 ...

Performance

1,385,639,019 80,000,116 100,271,634 100,257,154 16,894 2,347,561

3,765,471,466 80,000,292 160,386,634 100,533,848 7,005 9,966,476 没有 70,000,166 80,619,482 80,584,719 66,198 4,814,405 这些来正确但效率很低。
您之后对缓存未命中的期望(通过硬件预取减轻)是合理的。您可以使用 perf stat检查，如果您编写一些稍后使用该数组的实际测试代码。

您在每个单独的 uint64_t 上运行 clflushopt ，但 x86 缓存行在每个当前支持 clflushopt 的 CPU 上都是 64 字节。因此，您执行的刷新次数是原来的 8 倍，并且在某些 CPU 上重复刷新同一行可能会非常慢。 (比在缓存中刷新更多热线更糟糕。)
查看我在 上的回答The right way to use function _mm_clflush to flush a large struct 对于数组以相对于缓存行的未知对齐开始的一般情况，并且数组大小不是行大小的倍数。在包含任何数组/结构的每个缓存行上运行 clflush 或 clflushopt 一次。
除性能外，刷新是幂等的，因此您可以仅以 64 字节的增量循环并刷新数组的最后一个字节，但在该链接的答案中，我想出了一种廉价的方法来实现循环逻辑以仅触摸每一行一次.对于数组指针 + 长度，显然使用 sizeof(ptr[0]) * len而不是 sizeof(struct)就像使用的链接答案一样。

代码审查:API
冲洗适用于整条生产线。要么拿char* ，或 void*然后您将其转换为 char*按行大小增加它。因为从逻辑上讲，你给 asm 指令一个指针，它只刷新包含该字节的一行。

以前不需要的内存屏障
在冲洗之前 mfence 是没有意义的； clflushopt 是订购wrt。存储到相同的缓存行，因此将 clflushopt 存储到同一行(在 asm 中的顺序)将按该顺序发生，刷新新存储的数据。该手册记录了这一点(https://www.felixcloutier.com/x86/clflushopt 来自英特尔，我假设 AMD 的手册在其 CPU 上为其记录了相同的语义。)
我认为/希望 C 编译器对待 _mm_clflushopt(p)至少像对包含 p 的整行的不稳定访问一样，因此不会在编译时将存储重新排序为 *p 的任何 C 对象。可以别名。 (并且可能也不会加载。)如果没有，您最多需要 asm("":::"memory") ，仅限编译时的障碍。 (比如 atomic_signal_fence(memory_order_seq_cst) ，而不是 atomic_thread_fence)。

我认为如果您的循环不是微小的，并且您只关心该线程是否获得缓存未命中，那么在此之后进行围栏也是不必要的。用sfence肯定没用，它根本不排序加载，不像 mfence或 lfence正常使用原因sfence之后 clflushopt是为了保证较早的存储在任何较晚的存储之前已经进入持久存储，例如使崩溃后恢复一致性成为可能。 (在具有傲腾 DC PM 或其他类型的真正内存映射非 volatile RAM 的系统中)。参见例如 this Q&A关于 clflushopt 订购以及为什么有时需要 sfence。
这不会迫使以后的负载丢失，它们不是按顺序订购的。 sfence因此可以在 sfence 之前尽早执行及之前 clflushopt . mfence会阻止这种情况。 lfence (或大约 ROB 大小的 uop 数量，例如 Skylake 上的 224)可能，但等待 clflushopt从乱序的后端退出并不意味着它已经完成了驱逐线路。它可能更像一个商店，并且必须通过商店缓冲区。
我在我的 CPU 和 i7-6700k Intel Skylake 上对此进行了测试:

default rel ; dirty a couple BSS pages ; clflush vs. clushopt doesn't make a different here except in uops issued/executed ; actually speeds things up ; the load after this definitely misses. ; next cache line ; wrap to 4 pages ; __NR_exit_group  from /usr/include/asm/unistd_64.h t=testloop; asm-link -dn "$t".asm && taskset -c 3 perf stat --all-user -etask-clock,context-switches,cpu-migrations,page-faults,cycles,instructions,uops_issued.any,uops_executed.thread,mem_load_retired.l1_hit,mem_load_retired.l1_miss -r4 ./"$t" -Worphan-labels testloop.asm mov    eax,DWORD PTR [rdi] 66 0f ae 3f             clflushopt BYTE PTR [rdi] sfence mov    eax,DWORD PTR [rdi+0x10] 48 83 c7 40             add    rdi,0x40 48 81 e7 00 c0 ff ff    and    rdi,0xffffffffffffc000 dec    ebp jne    401040 <_start.loop> counter stats for './testloop' (4 runs): 334.27 msec task-clock                #    0.999 CPUs utilized            ( +-  7.62% ) 0      context-switches          #    0.000 K/sec 0      cpu-migrations            #    0.000 K/sec 3      page-faults               #    0.009 K/sec cycles                    #    4.145 GHz                      ( +-  7.68% ) instructions              #    0.06  insn per cycle           ( +-  0.00% ) uops_issued.any           #  299.968 M/sec                    ( +-  0.04% ) uops_executed.thread      #  299.924 M/sec                    ( +-  0.04% ) mem_load_retired.l1_hit   #    0.051 M/sec                    ( +- 17.24% ) mem_load_retired.l1_miss  #    7.023 M/sec                    ( +- 14.76% ) 0.3346 +- 0.0255 seconds time elapsed  ( +-  7.62% ) 那是 sfence ，并且出人意料地是最快的  .平均运行时间变化很大。使用 clflush而不是 clflushopt不会改变太多时间，但更多 uops:150,185,359      uops_issued.any (融合域)和 110,219,059      uops_executed.thread (未融合域)。
与 mfence是最慢的，每次 clflush 都会导致我们两次缓存未命中(一次是在加载后立即迭代，另一次是在我们返回时进行下一次迭代。)## With MFENCE cycles                    #    4.129 GHz                      ( +-  1.26% ) instructions              #    0.02  insn per cycle           ( +-  0.00% ) uops_issued.any           #  175.881 M/sec                    ( +-  0.03% ) uops_executed.thread      #  110.246 M/sec                    ( +-  0.06% ) mem_load_retired.l1_hit   #    0.008 M/sec                    ( +- 21.58% ) mem_load_retired.l1_miss  #   10.929 M/sec                    ( +-  0.05% ) 围栏，还是慢sfence .我不知道为什么。也许sfence停止执行如此快的 clflush 操作，让后面的迭代中的负载有机会领先于它们并且在 clflushopt 驱逐它之前读取缓存行？     2,047,314,028      cycles                    #    4.125 GHz                      ( +-  2.58% ) instructions              #    0.03  insn per cycle           ( +-  0.00% ) uops_issued.any           #  162.427 M/sec                    ( +-  0.05% ) uops_executed.thread      #  162.357 M/sec                    ( +-  0.04% ) mem_load_retired.l1_hit   #    0.133 M/sec                    ( +-  6.61% ) mem_load_retired.l1_miss  #    9.700 M/sec                    ( +-  4.59% ) 实验结果来自 Intel Skylake，而不是 AMD
(并且较旧或较新的英特尔在允许负载重新排序的方式上可能有所不同 clflushopt 。



					

					
					
						关于C 函数刷新所有包含数组的缓存行，我们在Stack Overflow上找到一个类似的问题：
							
								https://stackoverflow.com/questions/68138772/

C 函数刷新所有包含数组的缓存行

上一篇：python - 获取 n 的所有约数的该算法的运行时间复杂度是多少？

下一篇：python - 有没有更快的方法来检查 pygame 表面的 list[i] 是否具有 alpha 0