c++ - GPU 加速排序 (~1GB) 和归并排序 (~100GB)

我要求一个 c++ 库来执行 GPU 加速排序(大约 1GB 的数据)和合并排序(例如，大约 100GB 的数据——但大小无关紧要，因为合并是一种流算法)。许可证必须是 LGPL、BSD 或类似的。由于可移植性，我非常喜欢 OpenCL(但我也对 CUDA 库的链接感兴趣)。我很欣赏有关此主题的论文和博客文章的链接。

一些背景(如有错误请指正):

1GB(即128 000 000个8字节实体)的2-way merge sort会消耗大约log₂(128 000 000)·1G = 27GB内存带宽，即在顺序内存带宽约为 30GB/s 的现代 CPU 上大约需要 1 秒。 (任何非合并排序似乎都需要更长的时间，因为非顺序内存访问要慢 10-100 倍)。

虽然我不熟悉现代 GPU，但我怀疑 1GB 的合并排序将花费 0.2 秒甚至更少，因为典型的 GPU 内存带宽约为 150GB/s，如 AMD/ATI 58xx(参见，例如 http://en.wikipedia.org/wiki/Comparison_of_AMD_graphics_processing_units#Evergreen_.28HD_5xxx.29_series)

这至少是 5 倍的加速。 (通过 16x PCI-E 2.0 传输 1GB 的时间约为 0.125 秒，但似乎可以在排序的同时进行 PCI 传输；但是，这可能需要 2GB 或 3GB 的视频卡内存，而不是 1GB)。

我怀疑由于适用于 GPU 的多于 2 路合并排序或某种排序，速度会更快。

最佳答案

你看过Thrust了吗？？

来自项目页面:

Thrust is a parallel algorithms library which resembles the C++ Standard Template Library (STL). Thrust's high-level interface greatly enhances developer productivity while enabling performance portability between GPUs and multicore CPUs. Interoperability with established technologies (such as CUDA, TBB and OpenMP) facilitates integration with existing software. Develop high-performance applications rapidly with Thrust!

许可证是 Apache，所以它应该适合你。

关于c++ - GPU 加速排序 (~1GB) 和归并排序 (~100GB)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/14593791/

c++ - GPU 加速排序 (~1GB) 和归并排序 (~100GB)

一些背景(如有错误请指正):

上一篇：c++ - 使用描述符为 0 的 getpeername

下一篇：c++ - SDL Console 输出在调试时有效，但在使用 exe 运行时无效