intel - _mm512_storenr_pd 和 _mm512_storenrngo_pd

标签 intel intrinsics xeon-phi avx512

_mm512_storenrngo_pd 之间有什么区别?和 _mm512_storenr_pd

_mm512_storenr_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint to the processor.

我不清楚未读提示是什么意思。这是否意味着它是非缓存一致性写入。这是否意味着重用成本更高或不连贯?

_mm512_storenrngo_pd(void * mt, __m512d v):

Stores packed double-precision (64-bit) floating-point elements from v to memory address mt with a no-read hint and using a weakly-ordered memory consistency model (stores performed with this function are not globally ordered, and subsequent stores from the same thread can be observed before them).

storenr_pd 基本相同,但由于它使用弱一致性模型,这意味着进程可以在任何其他处理器之前查看自己的写入。但是另一个处理器的访问是不一致的还是更昂贵?

最佳答案

引自 Intel® Xeon Phi™ Coprocessor Vector Microarchitecture :

In general, in order to write to a cache line, the Xeon Phi™ coprocessor needs to read in a cache line before writing to it. This is known as read for ownership (RFO). One problem with this implementation is that the written data is not reused; we unnecessarily take up the BW for reading non-temporal data. The Intel® Xeon Phi™ coprocessor supports instructions that do not read in data if the data is a streaming store. These instructions, VMOVNRAP*, VMOVNRNGOAP* allow one to indicate that the data needs to be written without reading the data first. In the Xeon Phi ISA the VMOVNRAPS/VMOVNRPD instructions are able to optimize the memory BW in case of a cache miss by not going through the unnecessary read step.

The VMOVNRNGOAP* instructions are useful when the programmer tolerates weak write-ordering of the application data―that is, the stores performed by these instructions are not globally ordered. This means that the subsequent write by the same thread can be observed before the VMOVNRNGOAP instructions are executed. A memory-fencing operation should be used in conjunction with this operation if multiple threads are reading and writing to the same location.

似乎使用了“No-read hints”、“Streaming store”和“Non-temporal Stream/Store”在多个资源中可互换。

所以是的,它是非缓存一致性写入,尽管在 Knights Corner(KNC,vmovnrap* 和 vmovnrngoap* 都属于 KNC)中,存储恰好发生在 L2 缓存中,它不会绕过所有级别的缓存。

如上文所述,vmovnrngoap*vmovnrap* 不同,弱顺序内存一致性模型允许“同一线程的后续写入可以是在执行 VMOVNRNGOAP 指令之前观察到”,所以是的,另一个线程或处理器的访问是不一致的,应该使用屏蔽操作。尽管 CPUID 可以用作防护操作,但更好的选择是“LOCK ADD [RSP],0”(虚拟原子添加)或 XCHG(结合了存储和防护)。

更多细节:

NR Stores.The NR store instruction (vmovnr) is a standard vector store instruction that can always be used safely. An NR store instruction that misses in the local cache causes all potential copies of the cache line in remote caches to be invalidated, the cache line to be allocated (but not initialized) at the local cache in exclusive state, and the write-data in the instruction to be written to the cacheline. There is no data transfer from main memory which is what saves memory bandwidth. An NR store instruction and other load and/or store instructions from the same thread are globally ordered, which means that all observers of this sequence of instructions always see the same fixed execution order.

The NR.NGO (non-globally ordered) store instruction(vmovnrngo) relaxes the global ordering constraint of the NR store instruction.This relaxation makes the NR.NGO instruction have a lower latency than the NRinstruction, which can be used to achieve higher performance in streaming storeintensive applications. However, removing this restriction means that an NR.NGO store instruction and other load and/or store instructions from the same thread can be observed by two observers to have two different orderings. The use of NR.NGO store instructions is safe only when reordering the order of these instructions is verified not to change the outcome. Otherwise, using NR.NGO stores may lead to incorrect execution. Our compiler can generate NR.NGO store instructions for store instructions that it identifies to have non-temporal behavior. For instance, a parallel loop that is detected to be non-temporal by our compiler can make use of NR.NGO instructions. At the end of such a loop, to ensure all outstanding non-globally ordered stores are completed and all threads have a consistent view of memory, our compiler generates a fence (a lock instruction) after the loop. This fence is needed before continuing execution of the subsequent code fragment to ensure all threads have exactly the same view of memory.

一般的经验法则是,非临时存储有益于近期不会重用的内存访问 block 。因此,在这两种情况下,是的,重用都将是昂贵的。

关于intel - _mm512_storenr_pd 和 _mm512_storenrngo_pd,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45715429/

相关文章:

c++ - Intel 指令集扩展和用户机器(AVX、IMCI...)

c - 卸载到 MIC (Xeon Phi) 时迭代加载的阵列时出错

c++ - omp_get_thread_num() 返回有问题的数字?

c++ - STLPort、英特尔编译器、构建错误(尽管应用运行良好!)

android - 在 Windows 10 上为 Android Studio 安装 HAXM

.net - 相机外部计算错误

opencl - 比较 Intel Xeon Phi 和 Nvidia Tesla K20 的基准

c++ - 英特尔 SGX 将 c++ 类/结构作为 void* 传递给 enclave 并将其强制转换回来

c - 内函数,无法定义 (C)

c++ - 在 x86_64 上非临时加载 32 位和 64 位值的 C/C++ 内在函数?