x86 - 带增量的 AVX 加载指令

标签 x86 vectorization simd avx

是否有 AVX 指令能够从常规的增量对齐向量中加载四个 double 值？因此，如果我想要像 _mm256_load_pd(a) 这样的调用，只需增量 4，这样值就不会是 a[0]、a[1]、a[2] 和 a[3] 已加载，但 a[0]、a[4] 、a[8] 和 a[12]？

最佳答案

如果您有 AVX2(Haswell 及更高版本)，那么您可以使用聚集负载，例如_mm256_i32gather_pd。来自 Intel Intrinsics Guide :

Synopsis

__m256d _mm256_i32gather_pd (double const* base_addr, __m128i vindex, const int scale)

#include "immintrin.h"

Instruction: vgatherdpd ymm, vm64x, ymm

CPUID Flags: AVX2

Description

Gather double-precision (64-bit) floating-point elements from memory using 32-bit indices. 64-bit elements are loaded from addresses starting at base_addr and offset by each 32-bit element in vindex (each index is scaled by the factor in scale). Gathered elements are merged into dst. scale should be 1, 2, 4 or 8.

正如评论中已经指出的那样，收集的负载在 Haswell 上速度很慢，但如果您需要这种访问模式来进行后续的 256 位 SIMD 操作，那么它们可能仍然是值得的。由于您使用的是 double ，因此任何好处可能都很小，因此您可能还想针对传统标量实现进行基准测试。

关于x86 - 带增量的 AVX 加载指令，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34200839/

上一篇：caching - Symfony3 : Routing & Cache

下一篇：indexing - ArangoDB - IN 运算符非常慢

python - Numpy 查找大于前 n 个元素的元素

python - 如何将 jax vmap 用于嵌套循环？

C++优化内存读取速度

c - IA32 (x86) 机器上的数据类型定义

assembly - x86组件集 'push'和 'pusha'的区别

windows - 如果在特定地址写入则暂停进程

R - apply 系列中是否有任何函数可以为矩阵的每个条目计算 FUN？

x86 - CPU中的新指令集

floating-point - 我可以使用 AVX FMA 单元进行位精确的 52 位整数乘法吗？