我有一个矢量化优化问题。

我有一个结构 pDst，它有 3 个字段，分别命名为:'red'、'green' 和 'blue'。
类型可能是“Char”、“Short”或“Float”。这是给定的，不能更改。
还有另一个表示图像 [RGB] 的数组 pSrc - 即一个包含 3 个指针的数组，每个指针都指向图像的一层。
每一层都是使用 IPP 面向平面的图像构建的(即，每个平面都是独立形成的 - 'ippiMalloc_32f_C1'): http://software.intel.com/sites/products/documentation/hpc/ipp/ippi/ippi_ch3/functn_Malloc.html .

我们想按照以下代码中的描述复制它:

for(int y = 0; y < imageHeight; ++y)
{
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = pSrc[0][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = pSrc[1][x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = pSrc[2][x + y * pSrcRowStep];
    }
}

然而，在这种形式下，编译器无法向量化代码。
起初它说:

"loop was not vectorized: existence of vector dependence.".

当我使用#pragma ivdep 来帮助编译器时(因为没有依赖性)我得到以下错误:

"loop was not vectorized: dereference too complex.".

有人知道如何允许向量化吗？
我使用英特尔编译器 13.0。
谢谢。

更新:

如果我按如下方式编辑代码:

Ipp32f *redChannel      = pSrc[0];
Ipp32f *greenChannel  = pSrc[1];
Ipp32f *blueChannel     = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
    #pragma ivdep
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
    }
}

对于 'char' 和 'short' 的输出类型，我得到了 vecotization。
然而，对于“ float ”类型，我没有。
相反，我收到以下消息:

loop was not vectorized: vectorization possible but seems inefficient.

怎么可能呢？

最佳答案

在以下代码中，使用 pragma ivdep 确实忽略了 vector 依赖性，但编译器启发式/成本分析得出结论，向量化循环效率不高:

Ipp32f *redChannel      = pSrc[0];
Ipp32f *greenChannel  = pSrc[1];
Ipp32f *blueChannel     = pSrc[2];
for(int y = 0; y < imageHeight; ++y)
{
    #pragma ivdep
    for(int x = 0; x < imageWidth; ++x)
    {
        pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
        pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
    }
}

向量化将是低效的，因为该操作涉及将连续的内存块从源复制到目标的非连续内存位置。所以这里发生了分散。如果您仍想强制执行矢量化并查看与非矢量化版本相比是否有任何性能改进，请使用 pragma simd 而不是 pragma ivdep，如下所示:

#include<ipp.h>
struct Dest{
float red;
float green;
float blue;
};
void foo(Dest *pDst, Ipp32f **pSrc, int imageHeight, int imageWidth, int pSrcRowStep, int pDstRowStep){
    Ipp32f *redChannel      = pSrc[0];
    Ipp32f *greenChannel  = pSrc[1];
    Ipp32f *blueChannel     = pSrc[2];
    for(int y = 0; y < imageHeight; ++y)
    {
        #pragma simd
        for(int x = 0; x < imageWidth; ++x)
        {
            pDst[x + y * pDstRowStep].red     = redChannel[x + y * pSrcRowStep];
            pDst[x + y * pDstRowStep].green   = greenChannel[x + y * pSrcRowStep];
            pDst[x + y * pDstRowStep].blue    = blueChannel[x + y * pSrcRowStep];
        }
    }
    return;
}

对应的向量化报告为:

$ icpc -c test.cc -vec-report2
test.cc(14): (col. 9) remark: SIMD LOOP WAS VECTORIZED
test.cc(11): (col. 5) remark: loop was not vectorized: not inner loop

有关 pragma simd 的更多文档可在 https://software.intel.com/en-us/node/514582 获得.

关于c++ - 循环矢量化 001，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12707549/

c++ - 循环矢量化 001

更新:

上一篇：c++ - 将 C++ 库链接到具有非 C++ 主函数的程序

下一篇：c - 关于位对齐