我可以要求内核填充(故障)一系列匿名页面吗？

在 Linux 中，使用 C，如果我通过 malloc 或类似的动态分配机制请求大量内存，则支持返回区域的大多数页面实际上可能不会被映射到我的进程的地址空间。

相反，每次我第一次访问其中一个分配的页面时都会发生页面错误，然后内核将映射到“匿名”页面(完全由零组成)并返回到用户空间。

对于较大区域(例如 1 GiB)，这是大量页面错误(4 KiB 页面约为 26 万个)，并且每个错误都会导致用户到内核用户的转换在具有 Spectre 和 Meltdown 缓解措施的内核上特别慢。对于某些用途，此页面错误时间可能会主导缓冲区上正在完成的实际工作。

如果我知道我将使用整个缓冲区，是否有某种方法可以要求内核提前映射已经映射的区域？

如果我使用 mmap 分配自己的内存，则执行此操作的方法是 MAP_POPULATE - 但这对于从 malloc 接收的区域不起作用 或新。

有 madvise 调用，但其中的选项似乎主要适用于文件支持的区域。例如，madvise(..., MADV_WILLNEED) 调用似乎很有前途 - 从手册页来看:

MADV_WILLNEED

Expect access in the near future. (Hence, it might be a good idea to read some pages ahead.)

明显的含义是，如果该区域是文件支持的，则此调用可能会触发异步文件预读，或者可能会在后续故障时触发同步附加预读。从描述来看，并不清楚它是否会对匿名页面执行任何操作，并且根据我的测试，它不会。

最佳答案

这是一种肮脏的黑客行为，最适合特权进程或具有高 RLIMIT_MEMLOCK 的系统。，但是... mlock和munlock配对即可达到您想要的效果。

例如，给出以下测试程序:

# compile with (for e.g.,): cc -O1 -Wall    pagefaults.c   -o pagefaults

#include <stdlib.h>
#include <stdio.h>
#include <err.h>
#include <sys/mman.h>

#define DEFAULT_SIZE        (40 * 1024 * 1024)
#define PG_SIZE     4096

void failcheck(int ret, const char* what) {
    if (ret) {
        err(EXIT_FAILURE, "%s failed", what);
    } else {
        printf("%s OK\n", what);
    }
}

int main(int argc, char **argv) {
    size_t size = (argc == 2 ? atol(argv[1]) : DEFAULT_SIZE);
    char *mem = malloc(size);

    if (getenv("DO_MADVISE")) {
        failcheck(madvise(mem, size, MADV_WILLNEED), "madvise");
    }

    if (getenv("DO_MLOCK")) {
        failcheck(mlock(mem, size), "mlock");
        failcheck(munlock(mem, size), "munlock");
    }

    for (volatile char *p = mem; p < mem + size; p += PG_SIZE) {
        *p = 'z';
    }
    printf("size: %6.2f MiB, pages touched: %zu\npoitner value : %p\n",
            size / 1024. / 1024., size / PG_SIZE, mem);
}

以 root 身份运行 1 GB 区域，并用 perf 计数页面错误结果:

$ perf stat ./pagefaults 1000000000
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f2fc2584010

 Performance counter stats for './pagefaults 1000000000':

        352.474676      task-clock (msec)         #    0.999 CPUs utilized          
                 2      context-switches          #    0.006 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
           244,189      page-faults               #    0.693 M/sec                  
       914,276,474      cycles                    #    2.594 GHz                    
       703,359,688      instructions              #    0.77  insn per cycle         
       117,710,381      branches                  #  333.954 M/sec                  
           447,022      branch-misses             #    0.38% of all branches        

       0.352814087 seconds time elapsed

但是，如果您运行前缀为 DO_MLOCK=1 ，你得到:

sudo DO_MLOCK=1 perf stat ./pagefaults 1000000000
mlock OK
munlock OK
size: 953.67 MiB, pages touched: 244140
poitner value : 0x7f8047f6b010

 Performance counter stats for './pagefaults 1000000000':

        240.236189      task-clock (msec)         #    0.999 CPUs utilized          
                 0      context-switches          #    0.000 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
                49      page-faults               #    0.204 K/sec                  
       623,152,764      cycles                    #    2.594 GHz                    
       959,640,219      instructions              #    1.54  insn per cycle         
       150,713,144      branches                  #  627.354 M/sec                  
           484,400      branch-misses             #    0.32% of all branches        

       0.240538327 seconds time elapsed

请注意，页面错误数量已从 244,189 减少到 49，并且速度提高了 1.46 倍。绝大多数时间仍然花费在内核中，因此如果不需要同时调用 mlock ，这可能会快得多。和munlock也可能是因为 mlock 的语义超出了要求。

对于非特权进程，您可能会遇到 RLIMIT_MEMLOCK如果您尝试一次执行一个大区域(在我的 Ubuntu 系统上，它设置为 64 Kib)，但您可以循环调用 mlock(); munlock() 的区域在较小的区域。

关于我可以要求内核填充(故障)一系列匿名页面吗？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/56411164/

我可以要求内核填充(故障)一系列匿名页面吗？

上一篇：c - TLB、CPUID 和 Hugepages？

下一篇：linux - 让 gdb 自动读取 ./.gdbinit