performance - C++ - 最快的双线性插值？

我想加速我的 C++ 双线性插值代码。

设置如下:从灰度图像 img 我想提取一个矩形 block pat 在位置 cent 与单位间距和无上采样/下采样。</strong>

由于 cent 通常不是整数，我必须对提取的补丁进行双线性插值。

图像 img、提取的补丁 pat 和位置 cent 存储为 float 。补丁的大小为 [2*pad+1]，pad 是位置 cent 左右的填充。

目前的解决方案是这样的:

void function(Eigen::Matrix<float, Eigen::Dynamic, 1>* pat, const float* img, const Eigen::Vector2f* cent) { Eigen::Vector4f we; // bilinear weight vector // ... [CROPPED: compute bilinear weights] float *pat_it = pat->data(); for (y=cent[1]-pad; y <= cent[1]+pad; ++y) { int postmp_a = y * image_width; int postmp_b = (y-1) * image_width; for (x=cent[0]-pad; x <= cent[0]+pad; ++x, ++pat_it) { (*pat_it) = we[0] * img[ x + postmp_a] + we[1] * img[x-1 + postmp_a] + we[2] * img[ x + postmp_b] + we[3] * img[x-1 + postmp_b]; } } }

是否可以进一步加快速度？此函数将在实时信号处理管道中被调用数百万次。 没有内存限制。

是否有特定的 Eigen 函数？

由于这是我的代码最关键的瓶颈，我也愿意考虑将代码转移到不同的编程语言/架构(汇编程序、CUDA 等...)。对此有什么想法/提示吗？

更一般地说，您将如何系统地处理此问题以进行分析？

更多细节:该代码是使用“-Ofast -std=c++11”编译的，并且已经使用 OpenMP 并行运行。图片大小约为 1000x1200 像素，填充在 5-10 像素之间。

编辑

通过直接使用指向 4 个相应图像位置的指针，我已经成功地将速度提高了约 6%。

... for (x=cent[0]-pad; x <= cent[0]+pad; ++x,++pat_it, ++img_a,++img_b,++img_c,++img_d) { (*pat_it) = we[0] * (*img_a) + we[1] * (*img_b) + we[2] * (*img_c) + we[3] * (*img_d); } ...

最佳答案

你可以尝试让 Eigen 精简一些，比如:

void function(Eigen::VectorXf* pat, const float* img, const Eigen::Vector2f* cent) { ... for (y=cent[1]-pad; y <= cent[1]+pad; ++y) { ... Eigen::Map<Eigen::Array4f, 0, Eigen::OuterStride<>> mp(img + cent[0]-pad -1 + postmp_b, 4, Eigen::OuterStride<>(image_width)); for (x=cent[0]-pad; x <= cent[0]+pad; ++x, ++pat_it) { new (&mp) Eigen::Map<Eigen::Array4f>(img + x-1 + postmp_b, 4, Eigen::OuterStride<>(image_width)); (*pat_it) = (mp * we.array()).sum(); ...

注意:您可能需要重新排列 we 以匹配 img 元素的新顺序。

您可以尝试并做得更好，而不是创建一堆 map ，而是创建一个大 map :

void function(Eigen::VectorXf* pat, const float* img, const Eigen::Vector2f* cent) { ... Eigen::Map<Eigen::ArrayXXf, 0, Eigen::OuterStride<>> mp(img, image_width, image_height, Eigen::OuterStride<>(image_width)); for (y=cent[1]-pad; y <= cent[1]+pad; ++y) { ... for (x=cent[0]-pad; x <= cent[0]+pad; ++x, ++pat_it) { (*pat_it) = (mp.block<2,2>(x,y) * we.array()).sum(); ...

你也许可以做得更好，我还没有测试过这些。这使我得出以下免责声明。我没有测试过这个。也就是说，您可能需要更改 InnerStride 和 OuterStride，以及 image_width 和 image_height 等。

如果这对您有帮助，我很想知道它提供了多少加速。

关于performance - C++ - 最快的双线性插值？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31440257/

performance - C++ - 最快的双线性插值？

上一篇：aurelia - Kendo & Aurelia : jQuery(. ..).kendoPager 不是函数

下一篇：angularjs - 注入(inject) $state (ui-router) 导致循环依赖