在计算能力 2.x 设备上,我如何确保 gpu 在使用映射固定内存时使用合并内存访问,并假设通常在使用全局内存时二维数据需要填充?
我似乎无法在任何地方找到有关此的信息,也许我应该看起来更好,或者我可能遗漏了什么。欢迎任何指向正确方向的指示...
最佳答案
合并方法应该在使用零拷贝内存时应用。引用 CUDA C 最佳实践指南:
Because the data is not cached on the GPU, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced.
引用《CUDA Programming》一书,S. Cook 着
If you think about what happens with access to global memory, an entire cache line is brought in from memory on compute 2.x hardware. Even on compute 1.x hardware the same 128 bytes, potentially reduced to 64 or 32, is fetched from global memory. NVIDIA does not publish the size of the PCI-E transfers it uses, or details on how zero copy is actually implemented. However, the coalescing approach used for global memory could be used with PCI-E transfer. The warp memory latency hiding model can equally be applied to PCI-E transfers, providing there is enough arithmetic density to hide the latency of the PCI-E transfers.
关于c++ - CUDA 固定内存和合并,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19101354/