cuda - 阐明 CUDA 中的内存事务

我对 CUDA 编程指南 4.0 部分 5.3.2.1 中的以下语句感到困惑
在性能指南的章节中。

Global memory resides in device memory and device memory is accessed
via 32-, 64-, or 128-byte memory transactions. 

These memory transactions must be naturally aligned:Only the 32-, 64- , 
128- byte segments of device memory 
that are aligned to their size (i.e. whose first address is a 
multiple of their size) can be read or written by memory 
transactions.

1)
我对设备内存的理解是线程对设备内存的访问是未缓存的:因此，如果线程访问内存位置 a[i]它只会获取 a[i]并且没有一个
值约 a[i] .所以第一个陈述似乎与此矛盾。或者我可能误解了“内存事务”一词的用法？

2)第二句好像不是很清楚。有人可以解释一下吗？

最佳答案

内存事务按扭曲执行。因此，32 字节事务是 8 位类型的扭曲大小读取，64 字节事务是 16 位类型的扭曲大小读取，而 128 字节事务是 32 位类型的扭曲大小读取。

这只是意味着所有读取都必须与自然字大小边界对齐。 Warp 不可能读取具有 1 字节偏移量的 128 字节事务。见 this answer更多细节。

关于cuda - 阐明 CUDA 中的内存事务，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/11908142/

上一篇：XPages extlib oneui 布局 - 如何动态添加放置栏操作

下一篇：struts2 - 在 Jsp 中使用 Struts2 显示 Pdf

c++ - 使用数组实现四叉树

gpu - 为什么桌面 GPU 通常使用即时模式渲染而不是基于切片的延迟渲染？

css - CSS3 上的 GPU 渲染

compiler-errors - 推力(CUDA库)编译错误，例如 “' vectorize_from_shared_kernel__entry': is not a member of 'thrust::detail::device::cuda' ”

c++ - CUDA 设备代码中的 constexpr 数组

CUDA替代__syncthreads而不是__threadfence()差异

tensorflow - 未能分配 X 字节的统一内存；结果 : CUDA_ERROR_OUT_OF_MEMORY: out of memory

compilation - JIT 编译器是否有可能在幕后利用 GPU 进行某些操作？

linux - 在 pytorch 中，cuda.is_availbale()，但每个操作都因内存不足而失败