c++ - 变长时多线程的输出如何合并？

我正在实现一个 OpenCL 内核，其中每个线程都可能生成不同数量的数据。它基本上是一个半径搜索函数，因此每个点周围可以有不同数量的元素。

我当然可以运行它两次，一次是为了弄清楚我需要多少元素并在 C++ 代码端分配，但这是一种糟糕的方法。有没有一种方法可以将我的状态所谓的“保存”在我的内核代码中的某个地方，退出，重新分配拉出数据所需的资源，然后让另一个内核拉出数据？

最佳答案

如果您要稳健地实现这一点，您将需要某种启发式方法来确定“我的程序可以生成的最大可能输出是多少？”。我不知道你的算法的细节，所以知道确定它的复杂程度超出了我的能力。我的建议是找到算法的“精简版”，其唯一任务是评估每个工作项，“这是否会生成解决方案？如果是，则以原子方式递增全局变量”。

//Host Code
cl_mem mem = clCreateBuffer(context, CL_MEM_READ_WRITE, 1 * sizeof(cl_long), nullptr, nullptr);
clSetKernelArg(kernel, /**/, sizeof(cl_mem), &mem);
clEnqueueNDRangeKernel(/*...*/);
cl_long num_of_solutions;
clEnqueueReadBuffer(queue, mem, true, 0, 1 * sizeof(cl_long), &num_of_solutions, nullptr, nullptr);
//Increase your memory on your final buffer to accomodate the number of solutions reported.

//Kernel Code
kernel void count_solutions(global long * num_of_solutions) {
    size_t id = get_global_id(0);
    /* Implementation is dependent on you, but 'get_number_of_generated_solutions'
     * would, ideally, get the number of generated solutions *without* the heavy lifting
     * associated with actually generating those solutions at all. But that's dependent on
     * whether that's actually possible for your specific algorithm.
     */
    int found_solutions = get_number_of_generated_solutions(id);
    //not sure if you need to explicitly enable 64-bit atomics or not
    atomic_add(num_of_solutions, found_solutions);
}

然后，根据该结果分配空间。

关于c++ - 变长时多线程的输出如何合并？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/42905039/

c++ - 变长时多线程的输出如何合并？

上一篇：c++ - 将 Qt 宏 Q_OBJECT 用于插槽

下一篇：c++ - 如何删除堆中未存储到变量指针的对象？