c++ - 修改函数以使用 SSE 内在函数

我正在尝试计算根的近似值:sqrt(i + sqrt(i + sqrt(i + ...))) 使用 SSE 以从矢量化中获得加速(我还读到 SIMD 平方根函数的运行速度比固有的 FPU 平方根函数快大约 4.7 倍)。但是，我在矢量化版本中获得相同功能时遇到问题；我得到的值不正确，我不确定

我原来的功能是这样的:

template <typename T>
T CalculateRadical( T tValue, T tEps = std::numeric_limits<T>::epsilon() )
{
    static std::unordered_map<T,T> setResults;

    auto it = setResults.find( tValue );
    if( it != setResults.end() )
    {
        return it->second;
    }

    T tPrev = std::sqrt(tValue + std::sqrt(tValue)), tCurr = std::sqrt(tValue + tPrev);

    // Keep iterating until we get convergence:
    while( std::abs( tPrev - tCurr ) > tEps )
    {
        tPrev = tCurr;
        tCurr = std::sqrt(tValue + tPrev);
    }

    setResults.insert( std::make_pair( tValue, tCurr ) );
    return tCurr;
}

我写的 SIMD 等效项(当此模板函数用 T = float 实例化并给定 tEps = 0.0005f 时)是:

// SSE intrinsics hard-coded function:
__m128 CalculateRadicals( __m128 values )
{
    static std::unordered_map<float, __m128> setResults;

    // Store our epsilon as a vector for quick comparison:
    __declspec(align(16)) float flEps[4] = { 0.0005f, 0.0005f, 0.0005f, 0.0005f };
    __m128 eps = _mm_load_ps( flEps );

    union U {
        __m128 vec;
        float flArray[4];
    };

    U u;
    u.vec = values;

    float flFirstVal = u.flArray[0];
    auto it = setResults.find( flFirstVal );
    if( it != setResults.end( ) )
    {
        return it->second;
    }

    __m128 prev = _mm_sqrt_ps( _mm_add_ps( values, _mm_sqrt_ps( values ) ) );
    __m128 curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );

    while( _mm_movemask_ps( _mm_cmplt_ps( _mm_sub_ps( curr, prev ), eps ) ) != 0xF )
    {
        prev = curr;
        curr = _mm_sqrt_ps( _mm_add_ps( values, prev ) );
    }

    setResults.insert( std::make_pair( flFirstVal, curr ) );
    return curr;
}

我正在使用以下代码在循环中调用该函数:

long long N;
std::cin >> N;

float flExpectation = 0.0f;
long long iMultipleOf4 = (N / 4LL) * 4LL;
for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )
{
    __declspec(align(16)) float flArray[4] = { static_cast<float>(i - 3), static_cast<float>(i - 2), static_cast<float>(i - 1), static_cast<float>(i) };
    __m128 arg = _mm_load_ps( flArray );
    __m128 vec = CalculateRadicals( arg );

    float flSum = Sum( vec );
    flExpectation += flSum;
}

for( long long i = iMultipleOf4; i < N; ++i )
{
    flExpectation += CalculateRadical( static_cast<float>(i), 0.0005f );
}

flExpectation /= N;

我得到以下输入 5 的输出:

With SSE version: 2.20873
With FPU verison: 1.69647

差异从何而来，我在 SIMD 等价物中做错了什么？

编辑:我意识到 Sum 函数在这里是相关的:

float Sum( __m128 vec1 )
{
    float flTemp[4];
    _mm_storeu_ps( flTemp, vec1 );
    return flTemp[0] + flTemp[1] + flTemp[2] + flTemp[3];
}

最佳答案

SSE 内在函数有时会非常乏味......

但这里没有。你刚刚搞砸了你的循环:

for( long long i = iMultipleOf4; i > 0LL; i -= 4LL )

我怀疑它是否按照您的预期进行。如果 iMultipleOf4 是 4，那么您的函数将使用 4,3,2,1 而不是 0 进行计算。然后您的第二个循环使用 4 重新计算。

这两个函数对我来说给出了相同的结果，并且循环在更正后给出了相同的 flExpectation。尽管仍然存在细微差别，可能是因为 FPU 在计算方式上略有不同。

关于c++ - 修改函数以使用 SSE 内在函数，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28237601/

c++ - 修改函数以使用 SSE 内在函数

上一篇：c++ - QML 绑定(bind)不更新

下一篇：python - 设置嵌入式 Python 以编写 C++ 游戏脚本