x86 - _mm_cmpistri 的模式 12

2016年子串搜索的Simd算法paper :

bool like(const uint8_t* string, __m128i pat, [...]) {
    size_t i = 0;
    while (i + 16 < str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpistri(pat, str, 12);  // mode 12
        if (j >= 16) i += 16;
        else {
            if (j + pat_len <= 16) return true;
            i += j;
        }
    }
    // Process remainder
    if (i + pat_len <= str_len) {
        __m128i str = _mm_loadu_si128(&string[i]);
        size_t j = _mm_cmpestri(pat, pat_len,
                                str, str_len - i, 12);
        if (j < 16 && j + pat_len <= 16) return true;
    }
    return false;
}

_mm_cmpistri 的模式 12 是什么？

这很慢吗？

谢谢。

最佳答案

pcmpistri 在 Ryzen 上每 2 个时钟有 1 个吞吐量，在 Skylake 上每 3 个时钟有 1 个吞吐量。它是更快的 SSE4.2 字符串指令之一，比显式长度指令更快。 (https://agner.org/optimize/)。它非常适合子字符串搜索，但不适用于更简单的 strchr/memchr 搜索:How much faster are SSE4.2 string instructions than SSE2 for memcmp?和 SSE42 & STTNI - PcmpEstrM is twice slower than PcmpIstrM, is it true?

请注意，您的标题提到了 _mm_cmpestri，这是显式长度字符串的慢速版本。但您的代码使用 _mm_cmpistri，这是隐式长度字符串的快速版本。

(该搜索循环中的其余代码应该可以非常高效地编译。如果编译器使用分支而不是 cmov 来执行 i+=16 与 i+=j 条件，分支预测 + 推测执行将隐藏依赖性，因此可以同时进行多个迭代，但在大多数情况下在输入末尾找到部分匹配时会导致分支缺失向量。至少我认为这就是条件。使用 cmov 会在输入向量之间创建数据依赖性，并且指令的延迟约为其吞吐量的 2 或 3 倍。)

我不知道它与使用 AVX2 避免 SSE4.2 字符串指令的经过良好调整的 strstr 相比效果如何。我猜这可能取决于您正在搜索的子字符串的长度，或者可能是数据的其他属性，例如您找到的字符串的开头或结尾有多少个误报候选者。

您已在 https://github.com/WojciechMula/sse4-strstr 上找到的微基准应该不错。 Wojciech 编写了很好的代码，并且对各种 x86 uarch 的调优有足够的了解，可以真正实现良好的优化。我没有看过他的字符串基准测试，但我看过他的 popcnt 代码，该代码探索了将 Harley-Seal 与 AVX512F vpernternlogd 结合使用以实现大幅加速。

Intel's ISA ref manual (vol.2)有一个关于字符串指令模式的完整部分(第 4.1 节，“PCMPESTRI/PCMPESTRM/PCMPISTRI/PCMPISTRM 的 Imm8 控制字节操作”)，与 https://www.felixcloutier.com/x86/pcmpistri 上的条目分开。 .

通常您会以十六进制或二进制形式编写模式，而不是十进制，因为它有多个位字段。 12 = 0b00001100。

英特尔的内在函数指南还提供了有关操作的完整细节的伪代码，但如果您不知道高级目的，那么它会非常繁重。一旦你这样做了，它就会很有帮助。 https://software.intel.com/sites/landingpage/IntrinsicsGuide/#expand=2403,6062,4147,948&techs=SSE4_2,AVX,AVX2&text=pcmpi

另请参阅https://www.strchr.com/strcmp_and_strlen_using_sse_4.2有关各种模式的更易读的指南。在此引用部分内容:

Aggregation operations

The heart of a string-processing instruction is the aggregation operation (immediate bits [3:2]).
...
Equal ordered (imm[3:2] = 11). Substring search (strstr). The first operand contains a string to search for, the second is a string to search in. The bit mask includes 1 if the substring is found at the corresponding position:
 operand2 = "WhenWeWillBeWed!", operand1 = "We"
 IntRes1  =  000010000000100
After computing the aggregation function, IntRes1 can be complemented, expanded into byte mask (_mm_cmpistrm) or shrinked into index (_mm_cmpistri). The result is written into xmm0 or ECX registers. Intel manual explains these details well, so there is no need to repeat them here.

字节的低 2 位 (00) 指示字符格式:在本例中为 00 unsigned BYTE。

(有符号与无符号可能与比较相等而不是基于范围的模式无关。)

我认为第 5:4 位是“极性”，用于处理字符串的结尾。

位 6 是返回索引而不是掩码的指令“索引”版本的位扫描方向。 (如 bsr 与 bsf)。在本例中，0 查找第一个匹配项的开始位置，而不是最后一个匹配项的结束位置。

第 7 位(8 位立即数的高位)未使用/保留。

另请参阅 https://www.officedaytime.com/simd512e/simdimg/str.php?f=pcmpistri对于导致结果的步骤的表格/图表，以及如何立即修改/选择在各个步骤中执行的操作中的不同字段。

关于x86 - _mm_cmpistri 的模式 12，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/53926538/

x86 - _mm_cmpistri 的模式 12

上一篇：c# - 获取小数的整数和小数部分

下一篇：elixir - 我可以使用 ExUnit 和 Mix 将辅助模块拆分为不同的文件吗？