我使用 CUDA 7.0 和 nVidia 980 GTX 进行一些图像处理。在特定迭代中,通过 15-20 次内核调用和多次 cuFFT FFT/IFFT API 调用独立处理多个图 block 。
因此,我将每个图 block 放在它自己的 CUDA 流中,这样每个图 block 都会相对于主机异步执行它的操作字符串。每个图 block 在一次迭代中大小相同,因此它们共享一个 cuFFT 计划。主机线程快速通过命令以尝试让 GPU 加载工作。虽然这些操作是并行处理的,但我遇到了周期性的竞争条件,尤其是对 cuFFT 有疑问。如果我使用 cuFFTSetStream() 将 cuFFT 计划放置在流 0 中,用于图 block 0,并且在主机将共享 cuFFT 计划的流设置为图 block 1 的流 1 之前,图 block 0 的 FFT 实际上尚未在 GPU 上执行它发布 tile 1 在 GPU 上的工作,对于这个计划,cuFFTExec() 的行为是什么?
更简洁地说,是否在 cufftExec() 调用时计划设置为流中执行对 cufftExec() 的调用,而不管 cuFFTSetStream() 是否用于在先前的 FFT 调用之前更改后续图 block 的流实际上已经开始/完成了吗?
很抱歉没有发布代码,但我无法发布我的实际来源。
最佳答案
编辑: 正如评论中所指出的,如果相同的计划(相同的创建句柄)用于通过流在同一设备上同时执行 FFT,则 the user is responsible for managing separate work areas for each usage of such plan .这个问题似乎侧重于流行为本身,我剩下的答案也侧重于此,但这是重要的一点。
If I place a cuFFT plan in a stream 0 using cuFFTSetStream() for tile 0, and the FFT for tile 0 hasn't actually been executed on the GPU yet before the host sets the shared cuFFT plan's stream to stream 1 for tile 1 before it issues tile 1's work on the GPU, what is the behavior of cuFFTExec() for this plan?
假设您说的是流 1 和流 2,这样我们就可以避免围绕 NULL 流的任何可能的混淆。
CUFFT 应该尊重在计划通过 cufftExecXXX()
传递给 CUFFT 时为计划定义的流。通过 cufftSetStream()
对计划进行的后续更改应该不会影响用于先前发出的 cufftExecXXX()
调用的流。
我们可以使用分析器通过相当简单的测试来验证这一点。考虑以下测试代码:
$ cat t1089.cu
// NOTE: this code omits independent work-area handling for each plan
// which is necessary for a plan that will be shared between streams
// and executed concurrently
#include <cufft.h>
#include <assert.h>
#include <nvToolsExt.h>
#define DSIZE 1048576
#define BATCH 100
int main(){
const int nx = DSIZE;
const int nb = BATCH;
size_t ws = 0;
cufftHandle plan;
cufftResult res = cufftCreate(&plan);
assert(res == CUFFT_SUCCESS);
res = cufftMakePlan1d(plan, nx, CUFFT_C2C, nb, &ws);
assert(res == CUFFT_SUCCESS);
cufftComplex *d;
cudaMalloc(&d, nx*nb*sizeof(cufftComplex));
cudaMemset(d, 0, nx*nb*sizeof(cufftComplex));
cudaStream_t s1, s2;
cudaStreamCreate(&s1);
cudaStreamCreate(&s2);
res = cufftSetStream(plan, s1);
assert(res == CUFFT_SUCCESS);
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
res = cufftSetStream(plan, s2);
assert(res == CUFFT_SUCCESS);
nvtxMarkA("plan stream change");
res = cufftExecC2C(plan, d, d, CUFFT_FORWARD);
assert(res == CUFFT_SUCCESS);
cudaDeviceSynchronize();
return 0;
}
$ nvcc -o t1089 t1089.cu -lcufft -lnvToolsExt
$ cuda-memcheck ./t1089
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$
我们只是连续执行两个前向 FFT,在两者之间切换流。我们将使用 nvtx marker清楚地识别计划流关联更改请求发生的时间点。现在让我们看看 nvprof --print-api-trace
输出(删除冗长的启动序言):
983.84ms 617.00us cudaMalloc
984.46ms 21.628us cudaMemset
984.48ms 37.546us cudaStreamCreate
984.52ms 121.34us cudaStreamCreate
984.65ms 995ns cudaPeekAtLastError
984.67ms 996ns cudaConfigureCall
984.67ms 517ns cudaSetupArgument
984.67ms 21.908us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416])
984.69ms 349ns cudaGetLastError
984.69ms 203ns cudaPeekAtLastError
984.70ms 296ns cudaConfigureCall
984.70ms 216ns cudaSetupArgument
984.70ms 8.8920us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421])
984.71ms 272ns cudaGetLastError
984.71ms 177ns cudaPeekAtLastError
984.72ms 314ns cudaConfigureCall
984.72ms 229ns cudaSetupArgument
984.72ms 9.9230us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426])
984.73ms 295ns cudaGetLastError
984.77ms - [Marker] plan stream change
984.77ms 434ns cudaPeekAtLastError
984.78ms 357ns cudaConfigureCall
984.78ms 228ns cudaSetupArgument
984.78ms 10.642us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431])
984.79ms 287ns cudaGetLastError
984.79ms 193ns cudaPeekAtLastError
984.80ms 293ns cudaConfigureCall
984.80ms 208ns cudaSetupArgument
984.80ms 7.7620us cudaLaunch (void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436])
984.81ms 297ns cudaGetLastError
984.81ms 178ns cudaPeekAtLastError
984.81ms 269ns cudaConfigureCall
984.81ms 214ns cudaSetupArgument
984.81ms 7.4130us cudaLaunch (void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441])
984.82ms 312ns cudaGetLastError
984.82ms 152.63ms cudaDeviceSynchronize
$
我们看到每个 FFT 运算需要 3 次内核调用。在这两者之间,我们看到我们的 nvtx 标记指示计划流更改的请求何时发出,并且这发生在前 3 个内核启动之后但在最后 3 个之前就不足为奇了。最后,我们注意到基本上所有的执行时间被吸收在最后的 cudaDeviceSynchronize()
调用中。前面的所有调用都是异步的,因此在执行的第一毫秒内或多或少地“立即”执行。最后的同步吸收了 6 个内核的所有处理时间,总计约 150 毫秒。
因此,如果 cufftSetStream
对 cufftExecC2C()
调用的第一次迭代有影响,我们希望看到前 3 个内核中的部分或全部启动到与用于最后 3 个内核的流相同的流中。但是当我们查看 nvprof --print-gpu-trace
输出时:
$ nvprof --print-gpu-trace ./t1089
==3757== NVPROF is profiling process 3757, command: ./t1089
==3757== Profiling application: ./t1089
==3757== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Name
974.74ms 7.3440ms - - - - - 800.00MB 106.38GB/s Quadro 5000 (0) 1 7 [CUDA memset]
982.09ms 23.424ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [416]
1.00551s 21.172ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [421]
1.02669s 27.551ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 13 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [426]
1.05422s 23.592ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [431]
1.07781s 21.157ms (25600 2 1) (32 8 1) 32 8.0000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0064B::kernel1Mem<unsigned int, float, fftDirection_t=-1, unsigned int=32, unsigned int=4, CONSTANT, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix1_t, unsigned int, float>) [436]
1.09897s 27.913ms (25600 1 1) (16 16 1) 61 17.000KB 0B - - Quadro 5000 (0) 1 14 void spRadix0256B::kernel3Mem<unsigned int, float, fftDirection_t=-1, unsigned int=16, unsigned int=2, L1, ALL, WRITEBACK>(kernel_parameters_t<fft_mem_radix3_t, unsigned int, float>) [441]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
我们看到实际上前 3 个内核被发布到第一个流中,最后 3 个内核被发布到第二个流中,正如所要求的那样。 (所有内核的总执行时间约为 150 毫秒,正如 api 跟踪输出所建议的那样。)由于底层内核启动是异步的,并且在 cufftExecC2C()
返回之前发出call,如果你仔细考虑一下,你会得出结论,它必须是这样的。启动内核的流是在内核启动时指定的。 (当然,我认为这被认为是“首选”行为。)
关于并发流中的 CUDA cuFFT API 行为,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/35488348/