c - GCC:两个相似循环之间的矢量化差异

当使用 gcc -O3 编译时，为什么以下循环没有向量化(自动):

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foo () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] = b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

什么时候下一个呢？

#define SIZE (65536)

int a[SIZE], b[SIZE], c[SIZE];

int foov () {
  int i, j;

  for (i=0; i<SIZE; i++){
    for (j=i; j<SIZE; j++) {
      a[i] += b[i] > c[j] ? b[i] : c[j];
    }
  }
  return a[0];
}

唯一的区别是内循环中表达式的结果是赋给a[i]，还是加到a[i]。

作为引用，-ftree-vectorizer-verbose=6 为第一个(非矢量化)循环提供以下输出。

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: not vectorized: live stmt not supported: D.2700_5 = c[j_20];

v.c:5: note: vectorized 0 loops in function.

向量化循环的相同输出是:

v.c:8: note: not vectorized: inner-loop count not invariant.
v.c:9: note: Unknown alignment for access: c
v.c:9: note: Alignment of access forced using peeling.
v.c:9: note: vect_model_load_cost: aligned.
v.c:9: note: vect_model_load_cost: inside_cost = 1, outside_cost = 0 .
v.c:9: note: vect_model_simple_cost: inside_cost = 1, outside_cost = 1 .
v.c:9: note: vect_model_reduction_cost: inside_cost = 1, outside_cost = 6 .
v.c:9: note: cost model: prologue peel iters set to vf/2.
v.c:9: note: cost model: epilogue peel iters set to vf/2 because peeling for alignment is unknown .
v.c:9: note: Cost model analysis:
  Vector inside of loop cost: 3
  Vector outside of loop cost: 27
  Scalar iteration cost: 3
  Scalar outside cost: 7
  prologue iterations: 2
  epilogue iterations: 2
  Calculated minimum iters for profitability: 8

v.c:9: note:   Profitability threshold = 7

v.c:9: note: Profitability threshold is 7 loop iterations.
v.c:9: note: LOOP VECTORIZED.
v.c:5: note: vectorized 1 loops in function.

最佳答案

第一种情况:代码在每次迭代中覆盖相同的内存位置a[i]。由于循环迭代不是独立的，因此这从本质上对循环进行了顺序化。
(实际上，实际上只需要最后一次迭代。所以可以去掉整个内循环。)

在第二种情况下:GCC 将循环识别为缩减操作 - 为此它有特殊情况处理以进行矢量化。

编译器矢量化通常作为某种“模式匹配”来实现。这意味着编译器会分析代码以查看它是否符合能够向量化的特定模式。如果是这样，它将被矢量化。如果没有，那就没有。

这似乎是一个极端情况，其中第一个循环不适合 GCC 可以处理的任何预编码模式。但第二种情况符合“可向量化归约”模式。

这是 GCC 源代码的相关部分，其中吐出了 “未矢量化:不支持实时 stmt:” 消息:

http://svn.open64.net/svnroot/open64/trunk/osprey-gcc-4.2.0/gcc/tree-vect-analyze.c

if (STMT_VINFO_LIVE_P (stmt_info))
{
    ok = vectorizable_reduction (stmt, NULL, NULL);

    if (ok)
        need_to_vectorize = true;
    else
        ok = vectorizable_live_operation (stmt, NULL, NULL);

    if (!ok)
    {
        if (vect_print_dump_info (REPORT_UNVECTORIZED_LOOPS))
        {
            fprintf (vect_dump, 
                "not vectorized: live stmt not supported: ");
            print_generic_expr (vect_dump, stmt, TDF_SLIM);
        }
        return false;
    }
}

仅从行:

vectorizable_reduction (stmt, NULL, NULL);

很明显，GCC 正在检查它是否匹配“可向量化约简”模式。

关于c - GCC:两个相似循环之间的矢量化差异，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/12096599/

c - GCC:两个相似循环之间的矢量化差异

上一篇：c - Unix 域套接字 : Using datagram communication between one server process and several client processes

下一篇：c - 如何在 C 中播放 MP3 文件？