在尝试优化我的有限差分代码所需的未对齐读取时,我更改了未对齐的负载,如下所示:
__m128 pm1 =_mm_loadu_ps(&H[k-1]);
进入这个对齐的读取 + 洗牌代码:
__m128 p0 =_mm_load_ps(&H[k]);
__m128 pm4 =_mm_load_ps(&H[k-4]);
__m128 pm1 =_mm_shuffle_ps(p0,p0,0x90); // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
pm1 =_mm_move_ss(pm1,tpm1); // pack lowest float with 3 others
在哪里 H
是 16 字节对齐的; H[k+1]
也有类似的变化, H[k±3]
和 movlhps
& movhlps
H[k±2]
的优化( here 是循环的完整代码)。
我发现在我的 Core i7-930 上对阅读进行了优化 H[k±3]
似乎富有成效,同时为 ±1
添加了下一个优化减慢了我的循环(以百分比为单位)。在 ±1
之间切换和 ±3
优化并没有改变结果。
同时,在 Core 2 Duo 6300 和 Core 2 Quad 上启用这两种优化(对于 ±1
和 ±3
)提高了性能(提高了百分之几十),而对于 Core i7-4765T,这两者都减慢了它下降(以百分比为单位)。
在 Pentium 4 上,所有优化未对齐读取的尝试,包括带有 movlhps
的读取/movhlps
导致减速。
为什么不同的 CPU 会有如此不同?是因为代码大小增加导致循环可能不适合某些指令缓存吗?或者是因为一些 CPU 对未对齐的读取不敏感,而另一些则更敏感?或者,某些 CPU 上的随机播放等操作可能很慢?
最佳答案
英特尔每两年推出一种新的微架构。执行单元的数量可能会改变,以前只能在一个执行单元中执行的指令在较新的处理器中可能有 2 或 3 个可用。指令的延迟可能会发生变化,例如添加 shuffle
执行单元。
英特尔在他们的优化引用手册中进行了详细介绍,这是链接,下面我复制了相关部分。
3.5.2.7 浮点/SIMD 操作数部分
The MOVUPD from memory instruction performs two 64-bit loads, but requires additional μops to adjust the address and combine the loads into a single register. This same functionality can be obtained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer μops and can be packed into the trace cache more effectively. The latter alternative has been found to provide a several percent performance improvement in some cases. Its encoding requires more instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be used.
Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.
Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.
6.5.1.2 数据混合
部分Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS instructions.
The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of the execution engine. The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit. Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs. MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit shuffle unit.
关于performance - 为什么在某些 CPU 上 SSE 对齐读取 + shuffle 比未对齐读取慢,而在其他 CPU 上则不然?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23212882/