CUDA 使程序员能够编写类似 a & b | 的内容。 ~ c
(a
、b
、c
为 unsigned int
)。
GPU 内部做什么?它是否以某种方式“模拟”整数的按位运算,或者它们是否像传统 CPU 上一样高效?
最佳答案
根据CUDA Programming Guide v2.3 (第 5.1.1.1 节)按位运算全速运行(每个时钟周期 8 次运算)。
Integer Arithmetic
Throughput of integer add is 8 operations per clock cycle.
Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but mul24 provide 24-bit integer multiplication with a troughput of 8 operations per clock cycle. On future architectures however, mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.
Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.
Comparison Throughput of compare, min, max is 8 operations per clock cycle.
Bitwise Operations Throughput of any bitwise operation is 8 operations per clock cycle.
关于cuda - GPU (Geforce 9800) 如何实现按位整数运算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4264824/