sse - 有没有一种方法可以根据编译时未知的掩码长度来掩码 __m128i 寄存器的一端?

标签 sse simd avx

我有一个看似简单的问题。将字符串加载到 __m128i 寄存器(使用 _mm_loadu_si128),然后找到字符串的长度(使用 _mm_cmpistri)。现在,假设长度小于 16,我希望在第一个字符串结尾的零之后只有零。实现此目的的一种方法是仅将“len”字节复制到另一个寄存器,或者将原始寄存器与长度为 8 * len 的 1 掩码进行“与”运算。但要找到创建这种仅取决于计算长度的掩码的简单方法并不容易。

最佳答案

我会这样做。未经测试。

// Load 16 bytes and propagate the first zero towards the end of the register
inline __m128i loadNullTerminated( const char* pointer )
{
    // Load 16 bytes
    const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

    const __m128i zero = _mm_setzero_si128();
    // 0xFF for bytes that were '\0', 0 otherwise
    __m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

    // If you have long strings and expect most calls to not have any zeros, uncomment the line below.
    // You can return a flag to the caller, to know when to stop.
    // if( _mm_testz_si128( zeroBytes, zeroBytes ) ) return chars;

    // Propagate the first "0xFF" byte towards the end of the register.
    // Following 8 instructions are fast, 1 cycle latency/each.
    // Pretty sure _mm_movemask_epi8 / _BitScanForward / _mm_loadu_si128 is slightly slower even when the mask is in L1D
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 1 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 2 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 4 ) );
    zeroBytes = _mm_or_si128( zeroBytes, _mm_slli_si128( zeroBytes, 8 ) );
    // Now apply that mask
    return _mm_andnot_si128( zeroBytes, chars );
}

更新:这是另一个版本,使用了 Noah 关于 int64 -1 指令的想法。 可能会稍微快一点。 Disassembly.

__m128i loadNullTerminated_v2( const char* pointer )
{
    // Load 16 bytes
    const __m128i chars = _mm_loadu_si128( ( const __m128i* )pointer );

    const __m128i zero = _mm_setzero_si128();
    // 0xFF for bytes that were '\0', 0 otherwise
    const __m128i zeroBytes = _mm_cmpeq_epi8( chars, zero );

    // If you have long strings and expect most calls to not have any zeros, uncomment the line below.
    // You can return a flag to the caller, to know when to stop.
    // if( _mm_testz_si128( eq_zero, eq_zero ) ) return chars;

    // Using the fact that v-1 == v+(-1), and -1 has all bits set
    const __m128i ones = _mm_cmpeq_epi8( zero, zero );
    __m128i mask = _mm_add_epi64( zeroBytes, ones );
    // This instruction makes a mask filled with lowest valid bytes in each 64-bit lane
    mask = _mm_andnot_si128( zeroBytes, mask );

    // Now need to propagate across 64-bit lanes

    // ULLONG_MAX if there were no zeros in the corresponding 8-byte long pieces of the string
    __m128i crossLaneMask = _mm_cmpeq_epi64( zeroBytes, zero );
    // Move the lower 64-bit lanes of noZeroes64 into higher position
    crossLaneMask = _mm_unpacklo_epi64( mask, crossLaneMask );
    // Update the mask.
    // Lower 8 bytes will not change because _mm_unpacklo_epi64 copied that part from the mask.
    // However, upper lane may become zeroed out.
    // Happens when _mm_cmpeq_epi64 detected at least 1 '\0' in any of the first 8 characters.
    mask = _mm_and_si128( mask, crossLaneMask );

    // Apply that mask
    return _mm_and_si128( mask, chars );
}

关于sse - 有没有一种方法可以根据编译时未知的掩码长度来掩码 __m128i 寄存器的一端?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65186226/

相关文章:

optimization - SSE 程序在 AMD 上比在 Intel 上花费的时间长得多

gcc、simd 内在函数和快速数学概念

c++ - OpenMP 4 对齐选项?

assembly - MOVDQA 和 MOVNTDQA 以及 WB/WC 标记区域的 VMOVDQA 和 VMOVNTDQ 之间有什么区别?

python - 与 astype(int) 相比,numpy around/rint 慢

c++ - VS2013中SSE2代码运行报错

c++ - 在 std::bitset 中作为运算符重载提供的按位运算(&、^. | 等)是否使用 AVX 或 SSE4 指令?

c++ - 运行基本 Avx512 代码时获取非法指令

c - 较小等于的AVX2整数比较

performance - 长整数例程可以从 SSE 中受益吗?