assembly - 如何编写在现代 x64 处理器上高效运行的自修改代码？

我正在尝试加速可变位宽整数压缩方案，并且我对动态生成和执行汇编代码感兴趣。目前，大量时间花费在错误预测的间接分支上，并且根据发现的一系列位宽生成代码似乎是避免这种损失的唯一方法。

一般技术称为“子例程线程”(或“调用线程”，尽管也有其他定义)。目标是利用处理器有效的调用/返回预测以避免停顿。该方法在这里得到了很好的描述: http://webdocs.cs.ualberta.ca/~amaral/cascon/CDP05/slides/CDP05-berndl.pdf

生成的代码将只是一系列调用，然后返回。如果有 5 个宽度“ block ”[4,8,8,4,16]，它看起来像:

call $decode_4
call $decode_8
call $decode_8
call $decode_4
call $decode_16
ret

在实际使用中，这将是一个较长的调用系列，具有足够的长度，每个系列可能都是唯一的并且仅调用一次。生成和调用代码在这里和其他地方都有详细记录。但除了简单的“不要这样做”或经过深思熟虑的“有龙”之外，我还没有发现太多关于效率的讨论。甚至是Intel documentation主要是笼统地说:

8.1.3 Handling Self- and Cross-Modifying Code

The act of a processor writing data into a currently executing code segment with the intent of executing that data as code is called self-modifying code. IA-32 processors exhibit model-specific behavior when executing self modified code, depending upon how far ahead of the current execution pointer the code has been modified. ... Self-modifying code will execute at a lower level of performance than non-self-modifying or normal code. The degree of the performance deterioration will depend upon the frequency of modification and specific characteristics of the code.

11.6 SELF-MODIFYING CODE

A write to a memory location in a code segment that is currently cached in the processor causes the associated cache line (or lines) to be invalidated. This check is based on the physical address of the instruction. In addition, the P6 family and Pentium processors check whether a write to a code segment may modify an instruction that has been prefetched for execution. If the write affects a prefetched instruction, the prefetch queue is invalidated. This latter check is based on the linear address of the instruction. For the Pentium 4 and Intel Xeon processors, a write or a snoop of an instruction in a code segment, where the target instruction is already decoded and resident in the trace cache, invalidates the entire trace cache. The latter behavior means that programs that self-modify code can cause severe degradation of performance when run on the Pentium 4 and Intel Xeon processors.

虽然有一个性能计数器可以确定是否发生了不良情况(C3 04 MACHINE_CLEARS.SMC:检测到的自修改代码机器清除数)，但我想了解更多细节，特别是对于哈斯韦尔来说。我的印象是，只要我能够提前足够远的时间编写生成的代码，以便指令预取尚未到达那里，并且只要我不通过修改同一页面上的代码来触发 SMC 检测器(四分之一)页？)作为当前正在执行的任何内容，那么我应该获得良好的性能。但所有细节似乎都极其模糊:多近才算太近？多远才算足够远？

尝试将这些问题变成具体问题:

当前指令之前的最大距离是多少 Haswell 预取器曾经运行过吗？
当前指令后面的最大距离是多少 Haswell“跟踪缓存”可能包含？
MACHINE_CLEARS.SMC 事件的实际周期损失是多少在 Haswell 上？
如何在预测循环中运行生成/执行周期防止预取器吃掉自己的尾部？
如何安排流程，以便每段生成的代码都是总是“第一次见到”，不按指示操作已经缓存了？

最佳答案

这根本不必是自修改代码 - 它可以是动态创建的代码，即运行时生成的“蹦床”。

这意味着您保留一个(全局)函数指针，它将重定向到内存的可写/可执行映射部分 - 然后您可以在其中主动插入您想要进行的函数调用。

这样做的主要困难是 call 是与 IP 相关的(大多数 jmp 也是如此)，因此您必须计算内存位置之间的偏移量你的蹦床和“目标函数”。这本身就很简单 - 但是将其与 64 位代码结合起来，你会遇到 call 只能处理 +-2GB 范围内的位移的相对位移，它变得更加复杂 - 你' d 需要通过联动表来调用。

所以你基本上会创建这样的代码(/me 严重 UN*X 偏见，因此 AT&T 汇编，以及一些对 ELF-isms 的引用):

.Lstart_of_modifyable_section:
callq 0f
callq 1f
callq 2f
callq 3f
callq 4f
....
ret
.align 32
0:        jmpq tgt0
.align 32
1:        jmpq tgt1
.align 32
2:        jmpq tgt2
.align 32
3:        jmpq tgt3
.align 32
4:        jmpq tgt4
.align 32
...

这可以在编译时创建(只需创建一个可写文本部分)，也可以在运行时动态创建。

然后，您可以在运行时修补跳转目标。这类似于 .plt ELF 部分(PLT = 过程链接表)的工作方式 - 只是在那里，它是修补 jmp 插槽的动态链接器，而在您的情况下，您自己执行此操作。

如果你选择所有运行时，那么像上面这样的表甚至可以通过 C/C++ 轻松创建；从数据结构开始，例如:

typedef struct call_tbl_entry __attribute__(("packed")) {
    uint8_t call_opcode;
    int32_t call_displacement;
};
typedef union jmp_tbl_entry_t {
    uint8_t cacheline[32];
    struct {
        uint8_t jmp_opcode[2];    // 64bit absolute jump
        uint64_t jmp_tgtaddress;
    } tbl __attribute__(("packed"));
}

struct mytbl {
    struct call_tbl_entry calltbl[NUM_CALL_SLOTS];
    uint8_t ret_opcode;
    union jmp_tbl_entry jmptbl[NUM_CALL_SLOTS];
}

这里唯一关键且有点依赖于系统的事情是它的“打包”性质，需要告诉编译器(即不要填充 call 数组)，而那个应该缓存行对齐跳转表。

您需要制作calltbl[i].call_displacement = (int32_t)(&jmptbl[i]-&calltbl[i+1])，使用memset初始化空/未使用的跳转表(&jmptbl, 0xC3/* RET */, sizeof(jmptbl)) 然后根据需要填写跳转操作码和目标地址字段。

关于assembly - 如何编写在现代 x64 处理器上高效运行的自修改代码？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/17738154/

assembly - 如何编写在现代 x64 处理器上高效运行的自修改代码？

上一篇：Haskellservant wai 中间件无法正常工作

下一篇：有附件时 Pouchdb 不同步