gpgpu - 似乎达到了 CUDA 限制，但那是什么限制？

我有一个 CUDA 程序似乎达到了某种资源的某种限制，但我无法弄清楚该资源是什么。这是核函数:

__global__ void DoCheck(float2* points, int* segmentToPolylineIndexMap, 
                        int segmentCount, int* output)
{
    int segmentIndex = threadIdx.x + blockIdx.x * blockDim.x;
    int pointCount = segmentCount + 1;

    if(segmentIndex >= segmentCount)
        return;

    int polylineIndex = segmentToPolylineIndexMap[segmentIndex];
    int result = 0;
    if(polylineIndex >= 0)
    {
        float2 p1 = points[segmentIndex];
        float2 p2 = points[segmentIndex+1];
        float2 A = p2;
        float2 a;
        a.x = p2.x - p1.x;
        a.y = p2.y - p1.y;

        for(int i = segmentIndex+2; i < segmentCount; i++)
        {
            int currentPolylineIndex = segmentToPolylineIndexMap[i];

            // if not a different segment within out polyline and
            // not a fake segment
            bool isLegit = (currentPolylineIndex != polylineIndex && 
                currentPolylineIndex >= 0);      

            float2 p3 = points[i];
            float2 p4 = points[i+1];
            float2 B = p4;
            float2 b;
            b.x = p4.x - p3.x;
            b.y = p4.y - p3.y;

            float2 c;
            c.x = B.x - A.x;
            c.y = B.y - A.y;

            float2 b_perp;
            b_perp.x = -b.y;
            b_perp.y = b.x;

            float numerator = dot(b_perp, c);
            float denominator = dot(b_perp, a);
            bool isParallel = (denominator == 0.0);

            float quotient = numerator / denominator;
            float2 intersectionPoint;
            intersectionPoint.x = quotient * a.x + A.x;
            intersectionPoint.y = quotient * a.y + A.y;

            result = result | (isLegit && !isParallel && 
                intersectionPoint.x > min(p1.x, p2.x) && 
                intersectionPoint.x > min(p3.x, p4.x) && 
                intersectionPoint.x < max(p1.x, p2.x) && 
                intersectionPoint.x < max(p3.x, p4.x) && 
                intersectionPoint.y > min(p1.y, p2.y) && 
                intersectionPoint.y > min(p3.y, p4.y) && 
                intersectionPoint.y < max(p1.y, p2.y) && 
                intersectionPoint.y < max(p3.y, p4.y));
        }
    }

    output[segmentIndex] = result;
}

这里是执行内核函数的调用:

DoCheck<<<702, 32>>>(
    (float2*)devicePoints, 
    deviceSegmentsToPolylineIndexMap, 
    numSegments, 
    deviceOutput);

参数的大小如下:

devicePoints = 22,464 float2s = 179,712 字节
deviceSegmentsToPolylineIndexMap = 22,463 整数 = 89,852 字节
numSegments = 1 int = 4 字节
deviceOutput = 22,463 整数 = 89,852 字节

当我执行这个内核时，它使视频卡崩溃。看起来我正在达到某种限制，因为如果我使用 DoCheck<<<300, 32>>>(...); 执行内核，有用。需要说明的是，参数相同，只是 block 数不同。

知道为什么一个会导致视频驱动程序崩溃，而另一个却不会吗？失败的似乎仍在卡的 block 数限制内。

更新有关我的系统配置的更多信息:

显卡:nVidia 8800GT
CUDA 版本:1.1
操作系统:Windows Server 2008 R2

我也在笔记本电脑上试过，配置如下，结果一样:

显卡:nVidia Quadro FX 880M
CUDA 版本:1.2
操作系统:Windows 7 64 位

最佳答案

被耗尽的资源是时间。在所有当前的 CUDA 平台上，显示驱动程序都包含一个看门狗定时器，它将杀死任何执行时间超过几秒的内核。在运行显示器的卡上运行代码受此限制。

在您使用的 WDDM Windows 平台上，存在三种可能的解决方案/解决方法:

买一张Telsa卡，用TCC驱动，问题彻底解决
尝试修改注册表设置以增加计时器限制(谷歌搜索 TdrDelay 注册表项以获取更多信息，但我不是 Windows 用户，不能比这更具体)
将您的内核代码修改为“可重入”，并在多个内核启动而不是一个内核启动中处理数据并行工作负载。内核启动开销并不是那么大，通过多个内核运行处理工作负载通常很容易实现，具体取决于您使用的算法。

关于gpgpu - 似乎达到了 CUDA 限制，但那是什么限制？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/52239489/

gpgpu - 似乎达到了 CUDA 限制，但那是什么限制？

上一篇：javascript - Google Survey Opt-in 在 DevTools 控制台选项卡中给出 404 错误

下一篇：apache - HSTS 和重定向