cuda - GPU (Geforce 9800) 如何实现按位整数运算?

标签 cuda gpu bitwise-operators

CUDA 使程序员能够编写类似 a & b | 的内容。 ~ c (abcunsigned int)。

GPU 内部做什么?它是否以某种方式“模拟”整数的按位运算,或者它们是否像传统 CPU 上一样高效?

最佳答案

根据CUDA Programming Guide v2.3 (第 5.1.1.1 节)按位运算全速运行(每个时钟周期 8 次运算)。

Integer Arithmetic

Throughput of integer add is 8 operations per clock cycle.

Throughput of 32-bit integer multiplication is 2 operations per clock cycle, but mul24 provide 24-bit integer multiplication with a troughput of 8 operations per clock cycle. On future architectures however, mul24 will be slower than 32-bit integer multiplication, so we recommend to provide two kernels, one using mul24 and the other using generic 32-bit integer multiplication, to be called appropriately by the application.

Integer division and modulo operation are particularly costly and should be avoided if possible or replaced with bitwise operations whenever possible: If n is a power of 2, (i/n) is equivalent to (i>>log2(n)) and (i%n) is equivalent to (i&(n-1)); the compiler will perform these conversions if n is literal.

Comparison Throughput of compare, min, max is 8 operations per clock cycle.

Bitwise Operations Throughput of any bitwise operation is 8 operations per clock cycle.

关于cuda - GPU (Geforce 9800) 如何实现按位整数运算?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4264824/

相关文章:

cuda - 如果同时从多个 pthread 调用 CUDA 内核会怎样?

graphics - 我可以在英特尔处理器显卡 I7(第 3 代或第 4 代)上运行 Cuda 或 OpenCl

memory - Caffe:如何选择适合内存的最大可用批量大小?

go - 为什么要比较在 golang 中使用按位与?

java - 是否值得在方法中使用位运算符?

cuda - 我可以在 cuda 卡上实际分配多少内存

c++ - 将内核函数的参数作为 C++ 结构传递?

将字符数组从主机复制到设备后,CUDA: "Stack Overflow or Breakpoint Hit"和未指定的启动失败错误

c# - 在 C# 中读取 GPU 温度

c++ - 以编程方式设置应用程序的处理器亲和性