c++ - 更积极地优化 FMA 操作

我想构建一个表示多个(比如 N)算术类型的数据类型，并提供与使用运算符重载的算术类型相同的接口(interface)，这样我就可以得到像 Agner Fog 的 vectorclass 这样的数据类型。 .

请看这个例子:Godbolt

#include <array>

using std::size_t;

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD operator*(const T a, const LoopSIMD& x){
        LoopSIMD result;
        for(size_t i=0;i<S;++i)
            result[i] = a*x[i];
        return result;
    }

    LoopSIMD& operator +=(const LoopSIMD& x){
        for(size_t i=0;i<S;++i){
            (*this)[i] += x[i];
        }
        return *this;
    }
};

constexpr size_t N = 7;
typedef LoopSIMD<double,N> SIMD;

SIMD foo(double a, SIMD x, SIMD y){
    x += a*y;
    return x;
}

对于一定数量的元素，这似乎非常有效，gcc-10 为 6 个，clang-11 为 27 个。对于更多的元素，编译器不再使用 FMA(例如 vfmadd213pd)操作。相反，他们分别进行乘法(例如 vmulpd)和加法(例如 vaddpd)。

问题:

这种行为有充分的理由吗？
是否有任何编译器标志，以便我可以增加上述 gcc 值 6 和 clang 值 27？

谢谢!

最佳答案

您也可以简单地制作自己的 fma 函数:

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    friend LoopSIMD fma(const LoopSIMD& x, const T y, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(x[i], y, z[i]);
        }
        return result;
    }
    friend LoopSIMD fma(const T y, const LoopSIMD& x, const LoopSIMD& z) {
        LoopSIMD result;
        for (size_t i = 0; i < S; ++i) {
            result[i] = std::fma(y, x[i], z[i]);
        }
        return result;
    }
    // And more variants, taking `const LoopSIMD&, const LoopSIMD&, const T`, `const LoopSIMD&, const T, const T`, etc
};

SIMD foo(double a, SIMD x, SIMD y){
    return fma(a, y, x);
}

但是为了首先进行更好的优化，您应该对齐阵列。如果您这样做，您的原始代码将得到很好的优化:

constexpr size_t next_power_of_2_not_less_than(size_t n) {
    size_t pow = 1;
    while (pow < n) pow *= 2;
    return pow;
}

template<class T, size_t S>
class LoopSIMD : std::array<T,S>
{
public:
    // operators
} __attribute__((aligned(next_power_of_2_not_less_than(sizeof(T[S])))));

// Or with a c++11 attribute
/*
template<class T, size_t S>
class [[gnu::aligned(next_power_of_2_not_less_than(sizeof(T[S])))]] LoopSIMD : std::array<T,S>
{
public:
    // operators
};
*/

SIMD foo(double a, SIMD x, SIMD y){
    x += a * y;
    return x;
}

关于c++ - 更积极地优化 FMA 操作，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/64682270/

c++ - 更积极地优化 FMA 操作

上一篇：c++ - 为什么 C 和 C++ 支持结构内数组的成员赋值，但通常不支持？

下一篇：c++ - 您可以使用 auto 在 namespace 范围内定义类静态常量吗？