c++ - avx2 8 浮点寄存器上的水平最小值并随机排列配对寄存器

在 8 宽 simd 中进行射线与三角形相交测试后，我剩下的是更新 t、u 和 v，这是我在下面的标量中完成的(找到最低的 t 并更新 t、u、v 如果低于之前的 t) 。有没有办法用 simd 而不是标量来做到这一点？

int update_tuv(__m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out)
{
    alignas(32) float ts[8];_mm256_store_ps(ts, t);
    alignas(32) float us[8];_mm256_store_ps(us, u);
    alignas(32) float vs[8];_mm256_store_ps(vs, v);
    
    int min_index{0};    
    for (int i = 1; i < 8; ++i) {
        if (ts[i] < ts[min_index]) {
            min_index = i;
        }
    }

    if (ts[min_index] >= *t_out) { return -1; }

    *t_out = ts[min_index];
    *u_out = us[min_index];
    *v_out = vs[min_index];

    return min_index;
}

除了排列和最小值测试 8 次之外，我还没有找到找到水平 min t 并随机排列/排列 u 和 v 的解决方案。

最佳答案

首先找到t vector 的水平最小值。仅此一点就足以在您的第一次测试中拒绝值。然后找到第一个最小元素的索引，从 u 和 v vector 中提取并存储该 channel 。

// Horizontal minimum of the vector
inline float horizontalMinimum( __m256 v )
{
    __m128 i = _mm256_extractf128_ps( v, 1 );
    i = _mm_min_ps( i, _mm256_castps256_ps128( v ) );
    i = _mm_min_ps( i, _mm_movehl_ps( i, i ) );
    i = _mm_min_ss( i, _mm_movehdup_ps( i ) );
    return _mm_cvtss_f32( i );
}

int update_tuv_avx2( __m256 t, __m256 u, __m256 v, float* t_out, float* u_out, float* v_out )
{
    // Find the minimum t, reject if t_out is larger than that
    float current = *t_out;
    float ts = horizontalMinimum( t );
    if( ts >= current )
        return -1;
    // Should compile into vbroadcastss
    __m256 tMin = _mm256_set1_ps( ts );
    *t_out = ts;

    // Find the minimum index
    uint32_t mask = (uint32_t)_mm256_movemask_ps( _mm256_cmp_ps( t, tMin, _CMP_EQ_OQ ) );
    // If you don't yet have C++/20, use _tzcnt_u32 or _BitScanForward or __builtin_ctz intrinsics
    int minIndex = std::countr_zero( mask );

    // Prepare a permutation vector for the vpermps AVX2 instruction
    // We don't care what's in the highest 7 integer lanes in that vector, only need the first lane
    __m256i iv = _mm256_castsi128_si256( _mm_cvtsi32_si128( (int)minIndex ) );

    // Permute u and v vector, moving that element to the first lane
    u = _mm256_permutevar8x32_ps( u, iv );
    v = _mm256_permutevar8x32_ps( v, iv );

    // Update the outputs with the new numbers
    *u_out = _mm256_cvtss_f32( u );
    *v_out = _mm256_cvtss_f32( v );
    return minIndex;
}

虽然相对简单，并且可能比当前使用 vector 存储后跟标量加载的方法更快，但只有当 if 分支得到良好预测时，上述函数的性能才会很好。

当该分支不可预测时(从统计角度来看，会导致随机结果)，完全无分支的实现可能更适合。不过会更复杂，使用 _mm_load_ss 加载旧值，使用 _mm_blendv_ps 有条件更新，并使用 _mm_store_ss 存储回来。

关于c++ - avx2 8 浮点寄存器上的水平最小值并随机排列配对寄存器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/72859241/

c++ - avx2 8 浮点寄存器上的水平最小值并随机排列配对寄存器

上一篇：java.lang.UnsupportedOperationException : Cannot have circular references in bean class but got the circular reference of class java. 时间.ZoneOffset

下一篇：mysql - 为什么 NOT NULL 到 NULL 的迁移会触发大量 I/O 操作？