c++ - 使用 SSE/AVX 内在函数的快速点积

我正在寻找一种快速方法来计算具有 3 或 4 个分量的 vector 的点积。我尝试了几件事，但大多数在线示例都使用 float 组，而我们的数据结构不同。

我们使用 16 字节对齐的结构。代码摘录(简化):

struct float3 {
    float x, y, z, w; // 4th component unused here
}

struct float4 {
    float x, y, z, w;
}

在之前的测试中(使用 SSE4 内在点积或 FMA)，与使用以下常规 C++ 代码相比，我无法获得加速。

float dot(const float3 a, const float3 b) {
    return a.x*b.x + a.y*b.y + a.z*b.z;
}

测试是在 Intel Ivy Bridge/Haswell 上使用 gcc 和 clang 完成的。将数据加载到 SIMD 寄存器并再次将它们拉出所花费的时间似乎扼杀了所有的好处。

我将不胜感激一些帮助和想法，如何使用我们的 float3/4 数据结构有效地计算点积。 SSE4、AVX 甚至 AVX2 都可以。

编者注:对于 4 元素的情况，请参阅 How to Calculate single-vector Dot Product using SSE intrinsic functions in C 。对于 3 元素的情况，使用掩码可能也有好处。

最佳答案

从代数上讲，高效的 SIMD 看起来与标量代码几乎相同。因此，进行点积的正确方法是一次对 SEE 的四个浮点 vector 进行运算(AVX 为八个)。

考虑像这样构建您的代码

#include <x86intrin.h> struct float4 { __m128 xmm; float4 () {}; float4 (__m128 const & x) { xmm = x; } float4 & operator = (__m128 const & x) { xmm = x; return *this; } float4 & load(float const * p) { xmm = _mm_loadu_ps(p); return *this; } operator __m128() const { return xmm; } }; static inline float4 operator + (float4 const & a, float4 const & b) { return _mm_add_ps(a, b); } static inline float4 operator * (float4 const & a, float4 const & b) { return _mm_mul_ps(a, b); } struct block3 { float4 x, y, z; }; struct block4 { float4 x, y, z, w; }; static inline float4 dot(block3 const & a, block3 const & b) { return a.x*b.x + a.y*b.y + a.z*b.z; } static inline float4 dot(block4 const & a, block4 const & b) { return a.x*b.x + a.y*b.y + a.z*b.z + a.w*b.w; }

请注意，最后两个函数看起来与您的标量几乎相同 dot功能除了float变成 float4和 float4变成 block3或 block4 .这将最有效地进行点积。

关于c++ - 使用 SSE/AVX 内在函数的快速点积，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30590487/

c++ - 使用 SSE/AVX 内在函数的快速点积

上一篇：c++ - 如何将提取运算符 (>>) 与 vector<bool> 一起使用？

下一篇：c++ - 添加从 unique_ptr<T> 到 T* 的隐式转换