performance - 为什么在写入 2 个缓存行的部分时，在 Skylake-Xeon 上 `_mm_stream_si128` 比 `_mm_storeu_si128` 慢得多？但对哈斯韦尔影响较小

我的代码看起来像这样(简单加载、修改、存储)(我已经对其进行了简化以使其更具可读性):

__asm__ __volatile__ ( "vzeroupper" : : : );
while(...) {
  __m128i in = _mm_loadu_si128(inptr);
  __m128i out = in; // real code does more than this, but I've simplified it
  _mm_stream_si12(outptr,out);
  inptr  += 12;
  outptr += 16;
}

与我们较新的 Skylake 机器相比，此代码在我们较旧的 Sandy Bridge Haswell 硬件上的运行速度大约快 5 倍。例如，如果 while 循环运行大约 16e9 次迭代，则在 Sandy Bridge Haswell 上需要 14 秒，在 Skylake 上需要 70 秒。

我们升级到了 Skylake 上的持久微码，
并且还卡在 vzeroupper命令以避免任何 AVX 问题。两个修复都没有效果。
outptr对齐到 16 个字节，所以 stream命令应该写入对齐的地址。 (我进行了检查以验证此声明)。 inptr未按设计对齐。注释掉负载没有任何效果，限制命令是存储。 outptr和 inptr指向不同的内存区域，没有重叠。

如果我更换 _mm_stream_si128与 _mm_storeu_si128 ，代码在两台机器上运行得更快，大约 2.9 秒。

所以这两个问题是

1) 为什么 Sandy Bridge Haswell 和 Skylake 在使用 _mm_stream_si128 写的时候差别这么大固有的？

2) 为什么_mm_storeu_si128运行速度比流式传输速度快 5 倍？

当涉及到内在函数时，我是一个新手。

附录 - 测试用例

这是整个测试用例:https://godbolt.org/z/toM2lB

以下是我对两个不同处理器 E5-2680 v3 (Haswell) 和 8180 (Skylake) 进行的基准测试的总结。

// icpc -std=c++14  -msse4.2 -O3 -DNDEBUG ../mre.cpp  -o mre
// The following benchmark times were observed on a Intel(R) Xeon(R) Platinum 8180 CPU @ 2.50GHz
// and Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz.
// The command line was
//    perf stat ./mre 100000
//
//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     1.65   7.29
//   _mm_storeu_si128     0.41   0.40

流与存储的比率分别为 4x 或 18x。

我依赖默认 new分配器将我的数据对齐到 16 个字节。我在这里很幸运，它是对齐的。我已经测试过这是真的，并且在我的生产应用程序中，我使用对齐的分配器来绝对确定它，并检查地址，但我没有在示例中使用它，因为我认为这并不重要.

第二次编辑 - 64B 对齐输出

@Mystical 的评论让我检查输出是否全部缓存对齐。对 Tile 结构的写入是在 64-B 块中完成的，但 Tile 本身不是 64-B 对齐的(仅 16-B 对齐)。

所以改变了我的测试代码是这样的:

#if 0
    std::vector<Tile> tiles(outputPixels/32);
#else
    std::vector<Tile, boost::alignment::aligned_allocator<Tile,64>> tiles(outputPixels/32);
#endif

现在数字大不相同了:

//   STORER               time (seconds)
//                     E5-2680   8180
// ---------------------------------------------------
//   _mm_stream_si128     0.19   0.48
//   _mm_storeu_si128     0.25   0.52

所以一切都快得多。但是 Skylake 仍然比 Haswell 慢 2 倍。

第三次编辑。故意错位

我尝试了@HaidBrais 建议的测试。我特意分配了对齐到 64 字节的向量类，然后在分配器中添加了 16 字节或 32 字节，以便分配是 16 字节或 32 字节对齐，但不是 64 字节对齐。我还将循环次数增加到 1,000,000，并运行了 3 次测试并选择了最小的时间。

perf stat ./mre1  1000000

重申一下，2^N 对齐意味着它不与 2^(N+1) 或 2^(N+2) 对齐。

//   STORER               alignment time (seconds)
//                        byte  E5-2680   8180
// ---------------------------------------------------
//   _mm_storeu_si128     16       3.15   2.69
//   _mm_storeu_si128     32       3.16   2.60
//   _mm_storeu_si128     64       1.72   1.71
//   _mm_stream_si128     16      14.31  72.14 
//   _mm_stream_si128     32      14.44  72.09 
//   _mm_stream_si128     64       1.43   3.38

所以很明显缓存对齐给出了最好的结果，但是 _mm_stream_si128仅在 2680 处理器上更好，而在 8180 上会遭受某种我无法解释的惩罚。

为了将来使用，这是我使用的未对齐分配器(我没有对未对齐进行模板化，您必须编辑 32 并根据需要更改为 0 或 16):

template <class T >
struct Mallocator {
  typedef T value_type;
    Mallocator() = default;
      template <class U> constexpr Mallocator(const Mallocator<U>&) noexcept 
{}
        T* allocate(std::size_t n) {
                if(n > std::size_t(-1) / sizeof(T)) throw std::bad_alloc();
                    uint8_t* p1 = static_cast<uint8_t*>(aligned_alloc(64, (n+1)*sizeof(T)));
                    if(! p1) throw std::bad_alloc();
                    p1 += 32; // misalign on purpose
                    return reinterpret_cast<T*>(p1);
                          }
          void deallocate(T* p, std::size_t) noexcept {
              uint8_t* p1 = reinterpret_cast<uint8_t*>(p);
              p1 -= 32;
              std::free(p1); }
};
template <class T, class U>
bool operator==(const Mallocator<T>&, const Mallocator<U>&) { return true; }
template <class T, class U>
bool operator!=(const Mallocator<T>&, const Mallocator<U>&) { return false; }

...

std::vector<Tile, Mallocator<Tile>> tiles(outputPixels/32);

最佳答案

简化的代码并没有真正显示您的基准测试的实际结构。我认为简化的代码不会表现出您提到的缓慢。

您的 Godbolt 代码中的实际循环是:

while (count > 0)
        {
            // std::cout << std::hex << (void*) ptr << " " << (void*) tile <<std::endl;
            __m128i value0 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 0 * diffBytes));
            __m128i value1 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 1 * diffBytes));
            __m128i value2 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 2 * diffBytes));
            __m128i value3 = _mm_loadu_si128(reinterpret_cast<const __m128i*>(ptr + 3 * diffBytes));

            __m128i tileVal0 = value0;
            __m128i tileVal1 = value1;
            __m128i tileVal2 = value2;
            __m128i tileVal3 = value3;

            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 0), tileVal0);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 1), tileVal1);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 2), tileVal2);
            STORER(reinterpret_cast<__m128i*>(tile + ipixel + diffPixels * 3), tileVal3);

            ptr    += diffBytes * 4;
            count  -= diffBytes * 4;
            tile   += diffPixels * 4;
            ipixel += diffPixels * 4;
            if (ipixel == 32)
            {
                // go to next tile
                ipixel = 0;
                tileIter++;
                tile = reinterpret_cast<uint16_t*>(tileIter->pixels);
            }
        }

请注意 if (ipixel == 32)部分。每次都会跳转到不同的图块 ipixel达到 32。自 diffPixels是 8，每次迭代都会发生这种情况。因此，每个图块仅创建 4 个流存储(64 字节)。除非每个 tile 恰好是 64 字节对齐的，这不太可能偶然发生并且不能依赖，这意味着每次写入仅写入两个不同缓存线的一部分。这是流媒体商店的一种已知反模式:为了有效使用流媒体商店，您需要写出完整的行。

关于性能差异:流媒体商店在不同硬件上的性能差异很大。这些存储总是占用一个行填充缓冲区一段时间，但时间长短各不相同:在许多客户端芯片上，它似乎只占用了大约 L3 延迟的缓冲区。即，一旦流媒体存储到达 L3，它就可以被移交(L3 将跟踪其余的工作)并且可以在核心上释放 LFB。服务器芯片通常具有更长的延迟。尤其是多路主机。

显然，NT store 的性能在 SKX box 上更差，对于部分行写入更差。整体性能较差可能与重新设计 L3 缓存有关。

关于performance - 为什么在写入 2 个缓存行的部分时，在 Skylake-Xeon 上 `_mm_stream_si128` 比 `_mm_storeu_si128` 慢得多？但对哈斯韦尔影响较小，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57420025/

performance - 为什么在写入 2 个缓存行的部分时，在 Skylake-Xeon 上 `_mm_stream_si128` 比 `_mm_storeu_si128` 慢得多？但对哈斯韦尔影响较小

上一篇：python-3.x - 使用 Python Win32 发送电子邮件。将图像添加到电子邮件正文不起作用

下一篇：sql - 计算 group by 中具有最大值的行