sorting - Thrust::sort 有多快以及最快的基数排序实现是什么

我是 GPU 编程的新手。最近，我正在尝试根据一个教程实现gpu bvh构建算法:http://devblogs.nvidia.com/parallelforall/thinking-parallel-part-iii-tree-construction-gpu/ 。在该算法的第一步中，计算并排序每个原语的莫顿代码(unsigned int)。教程给出了12K对象的morton代码计算和排序的引用时间成本:

0.02 毫秒，每个对象一个线程:计算边界框并分配 Morton 代码。
0.18 ms，并行基数排序:根据对象的 Morton 代码对对象进行排序。 ...

在我的实现中，第一步花费了 0.1ms，排序步骤花费了 1.8ms。我正在使用推力进行排序。那么在 GPU 上最快的基数排序实现是什么呢？

我使用的是 Geforce Titan GPU，它应该比本教程作者使用的 Geforce GTX690 更快。这是我的排序测试代码，即使大小为10，也花费大约1.5ms。

void testSort()
{
    int sz = 10;
    thrust::host_vector<unsigned int> h_keys(sz);
    for(int i=0; i<sz; i++)
        h_keys[i] = rand();
    thrust::device_ptr<unsigned int> keys = thrust::device_malloc<unsigned int>(sz);
    thrust::copy(h_keys.begin(),h_keys.end(),keys);
    cudaEvent_t estart, estop;
    cudaEventCreate( &estart );
    cudaEventCreate( &estop );
    cudaEventRecord( estart, 0 );
    thrust::stable_sort(keys,keys+sz);
    cudaEventRecord( estop, 0 ) ;
    cudaEventSynchronize( estop );
    float elapsedTime;
    cudaEventElapsedTime( &elapsedTime,
        estart, estop ) ;
    printf( "Time to sort: %3.1f ms\n", elapsedTime );
    cudaEventDestroy( estart ) ;
    cudaEventDestroy( estop ) ;
}

最佳答案

GPGPU 有一个基数排序实现 back40computing 。他们提供了一个性能比较图表，声称他们的实现是最快的。

关于sorting - Thrust::sort 有多快以及最快的基数排序实现是什么，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/20854838/

sorting - Thrust::sort 有多快以及最快的基数排序实现是什么

上一篇：physics - 带移动枢轴的摆锤

下一篇：Magento 1.9.1.0 - 无法生成站点地图