gcc - 为什么这个SSE2程序(整数)生成movaps(浮点)?

标签 gcc assembly x86 sse simd

以下循环将一个整数矩阵转置为另一个整数矩阵。当我有趣地编译时,它会生成 movaps 指令以将结果存储到输出矩阵中。为什么gcc这样做?

数据:

int __attribute__(( aligned(16))) t[N][M]  
  , __attribute__(( aligned(16))) c_tra[N][M];

循环:

for( i=0; i<N; i+=4){
    for(j=0; j<M; j+=4){

        row0 = _mm_load_si128((__m128i *)&t[i][j]);
        row1 = _mm_load_si128((__m128i *)&t[i+1][j]);
        row2 = _mm_load_si128((__m128i *)&t[i+2][j]);
        row3 = _mm_load_si128((__m128i *)&t[i+3][j]);

        __t0 = _mm_unpacklo_epi32(row0, row1);
        __t1 = _mm_unpacklo_epi32(row2, row3);
        __t2 = _mm_unpackhi_epi32(row0, row1);
        __t3 = _mm_unpackhi_epi32(row2, row3);

        /* values back into I[0-3] */
        row0 = _mm_unpacklo_epi64(__t0, __t1);
        row1 = _mm_unpackhi_epi64(__t0, __t1);
        row2 = _mm_unpacklo_epi64(__t2, __t3);
        row3 = _mm_unpackhi_epi64(__t2, __t3);

        _mm_store_si128((__m128i *)&c_tra[j][i], row0);
        _mm_store_si128((__m128i *)&c_tra[j+1][i], row1);
        _mm_store_si128((__m128i *)&c_tra[j+2][i], row2);
        _mm_store_si128((__m128i *)&c_tra[j+3][i], row3);



    }
}

汇编生成的代码:

.L39:
    lea rcx, [rsi+rdx]
    movdqa  xmm1, XMMWORD PTR [rdx]
    add rdx, 16
    add rax, 2048
    movdqa  xmm6, XMMWORD PTR [rcx+rdi]
    movdqa  xmm3, xmm1
    movdqa  xmm2, XMMWORD PTR [rcx+r9]
    punpckldq   xmm3, xmm6
    movdqa  xmm5, XMMWORD PTR [rcx+r10]
    movdqa  xmm4, xmm2
    punpckhdq   xmm1, xmm6
    punpckldq   xmm4, xmm5
    punpckhdq   xmm2, xmm5
    movdqa  xmm5, xmm3
    punpckhqdq  xmm3, xmm4
    punpcklqdq  xmm5, xmm4
    movdqa  xmm4, xmm1
    punpckhqdq  xmm1, xmm2
    punpcklqdq  xmm4, xmm2
    movaps  XMMWORD PTR [rax-2048], xmm5
    movaps  XMMWORD PTR [rax-1536], xmm3
    movaps  XMMWORD PTR [rax-1024], xmm4
    movaps  XMMWORD PTR [rax-512], xmm1
    cmp r11, rdx
    jne .L39

gcc -Wall -msse4.2 -masm="intel"-O2 -c -S 天湖 linuxmint

-mavx2-march=naticve 生成 VEX 编码:vmovaps

最佳答案

从功能上讲,这些指令是相同的。 我不喜欢复制+粘贴其他人的陈述作为我的陈述,所以解释它的链接很少:

Difference between MOVDQA and MOVAPS x86 instructions?

https://software.intel.com/en-us/forums/intel-isa-extensions/topic/279587

http://masm32.com/board/index.php?topic=1138.0

https://www.gamedev.net/blog/615/entry-2250281-demystifying-sse-move-instructions/

简短版本:

So for the most part, you should try to use the move instruction that corresponds with the operations you are going to use on those registers. However, there is an additional complication. Loads and stores to and from memory execute on a separate port from the integer and floating point units; thus instructions that load from memory into a register or store from a register into memory will experience the same delay regardless of the data type you attach to the move. Thus in this case, movaps, movapd, and movdqa will have the same delay no matter what data you use. Since movaps (and movups) is encoded in binary form with one less byte than the other two, it makes sense to use it for all reg-mem moves, regardless of the data type.

所以这是GCC优化。

关于gcc - 为什么这个SSE2程序(整数)生成movaps(浮点)?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42250184/

相关文章:

c++ - 内部 if 语句杀死矢量化

c - 这个变量值在函数中如何变化?

c++ - 为什么 nasm 说我在 g++ 创建的程序集中有错误?

将程序集转换为等效的 C 代码

assembly - x86 asm 中 NOT 指令的简单示例

assembly - 具有列优先布局的 int8 x uint8 矩阵向量乘积

c - 在 Mac OS high sierra 上安装 GCC

c - __asm__ ("__isoc99_scanf") 在函数声明之后

linux - 小数值相减(汇编)

assembly - 如何在不支持硬件的情况下测试 AVX-512 指令?