performance - 为什么在某些 CPU 上 SSE 对齐读取 + shuffle 比未对齐读取慢，而在其他 CPU 上则不然？

在尝试优化我的有限差分代码所需的未对齐读取时，我更改了未对齐的负载，如下所示:

__m128 pm1 =_mm_loadu_ps(&H[k-1]);

进入这个对齐的读取 + 洗牌代码:

__m128 p0   =_mm_load_ps(&H[k]);
__m128 pm4  =_mm_load_ps(&H[k-4]);
__m128 pm1  =_mm_shuffle_ps(p0,p0,0x90);   // move 3 floats to higher positions
__m128 tpm1 =_mm_shuffle_ps(pm4,pm4,0x03); // get missing lowest float
       pm1  =_mm_move_ss(pm1,tpm1);        // pack lowest float with 3 others

在哪里 H是 16 字节对齐的； H[k+1] 也有类似的变化, H[k±3]和 movlhps & movhlps H[k±2] 的优化( here 是循环的完整代码)。

我发现在我的 Core i7-930 上对阅读进行了优化 H[k±3]似乎富有成效，同时为 ±1 添加了下一个优化减慢了我的循环(以百分比为单位)。在 ±1 之间切换和 ±3优化并没有改变结果。

同时，在 Core 2 Duo 6300 和 Core 2 Quad 上启用这两种优化(对于 ±1 和 ±3 )提高了性能(提高了百分之几十)，而对于 Core i7-4765T，这两者都减慢了它下降(以百分比为单位)。

在 Pentium 4 上，所有优化未对齐读取的尝试，包括带有 movlhps 的读取/movhlps导致减速。

为什么不同的 CPU 会有如此不同？是因为代码大小增加导致循环可能不适合某些指令缓存吗？或者是因为一些 CPU 对未对齐的读取不敏感，而另一些则更敏感？或者，某些 CPU 上的随机播放等操作可能很慢？

最佳答案

英特尔每两年推出一种新的微架构。执行单元的数量可能会改变，以前只能在一个执行单元中执行的指令在较新的处理器中可能有 2 或 3 个可用。指令的延迟可能会发生变化，例如添加 shuffle 执行单元。

英特尔在他们的优化引用手册中进行了详细介绍，这是链接，下面我复制了相关部分。

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-optimization-manual.pdf

3.5.2.7 浮点/SIMD 操作数部分

The MOVUPD from memory instruction performs two 64-bit loads, but requires additional μops to adjust the address and combine the loads into a single register. This same functionality can be obtained using MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2, which uses fewer μops and can be packed into the trace cache more effectively. The latter alternative has been found to provide a several percent performance improvement in some cases. Its encoding requires more instruction bytes, but this is seldom an issue for the Pentium 4 processor. The store version of MOVUPD is complex and slow, so much so that the sequence with two MOVSD and a UNPCKHPD should always be used.

Assembly/Compiler Coding Rule 44. (ML impact, L generality) Instead of using MOVUPD XMMREG1, MEM for a unaligned 128-bit load, use MOVSD XMMREG1, MEM; MOVSD XMMREG2, MEM+8; UNPCKLPD XMMREG1, XMMREG2. If the additional register is not available, then use MOVSD XMMREG1, MEM; MOVHPD XMMREG1, MEM+8.

Assembly/Compiler Coding Rule 45. (M impact, ML generality) Instead of using MOVUPD MEM, XMMREG1 for a store, use MOVSD MEM, XMMREG1; UNPCKHPD XMMREG1, XMMREG1; MOVSD MEM+8, XMMREG1 instead.

6.5.1.2 数据混合

部分

Swizzling data from SoA to AoS format can apply to a number of application domains, including 3D geometry, video and imaging. Two different swizzling techniques can be adapted to handle floating-point and integer data. Example 6-3 illustrates a swizzle function that uses SHUFPS, MOVLHPS, MOVHLPS instructions.

enter image description here

The technique in Example 6-3 (loading 16 bytes, using SHUFPS and copying halves of XMM registers) is preferable over an alternate approach of loading halves of each vector using MOVLPS/MOVHPS on newer microarchitectures. This is because loading 8 bytes using MOVLPS/MOVHPS can create code dependency and reduce the throughput of the execution engine. The performance considerations of Example 6-3 and Example 6-4 often depends on the characteristics of each microarchitecture. For example, in Intel Core microarchitecture, executing a SHUFPS tend to be slower than a PUNPCKxxx instruction. In Enhanced Intel Core microarchitecture, SHUFPS and PUNPCKxxx instruction all executes with 1 cycle throughput due to the 128-bit shuffle execution unit. Then the next important consideration is that there is only one port that can execute PUNPCKxxx vs. MOVLHPS/MOVHLPS can execute on multiple ports. The performance of both techniques improves on Intel Core microarchitecture over previous microarchitectures due to 3 ports for executing SIMD instructions. Both techniques improves further on Enhanced Intel Core microarchitecture due to the 128-bit shuffle unit.

关于performance - 为什么在某些 CPU 上 SSE 对齐读取 + shuffle 比未对齐读取慢，而在其他 CPU 上则不然？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/23212882/

performance - 为什么在某些 CPU 上 SSE 对齐读取 + shuffle 比未对齐读取慢，而在其他 CPU 上则不然？

上一篇：arrays - 通过改变数据结构来加速在 Perl 中的数组搜索

下一篇：linux-kernel - linux 的 perf 实用程序如何理解堆栈跟踪？