c++ - 从 C++ 中的参数复制本地数组比数组更快？

在优化一些代码时，我发现了一些我没有预料到的事情。我写了一个简单的代码来说明我在下面发现的内容:

#include <string.h>
#include <chrono>
#include <iostream>

using namespace std;

int globalArr[1024][1024];

void initArr(int arr[1024][1024])
{
    memset(arr, 0, 1024 * 1024 * sizeof(int));
}


void run()
{
    int arr[1024][1024];
    initArr(arr);
    for(int i = 0; i < 1024; ++i)
    {
        for(int j = 0; j < 1024; ++j)
        {
            globalArr[i][j] = arr[i][j];
        }

    }
}

void run2(int arr[1024][1024])
{
    initArr(arr);
    for(int i = 0; i < 1024; ++i)
    {
        for(int j = 0; j < 1024; ++j)
        {
            globalArr[i][j] = arr[i][j];
        }

    }
}

int main()
{
    {
        auto start = chrono::high_resolution_clock::now();
        for(int i = 0; i < 256; ++i)
        {
            run();
        }
        auto duration = chrono::high_resolution_clock::now() - start;
        cout << "(run) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";
    }

    {
        auto start = chrono::high_resolution_clock::now();
        for(int i = 0; i < 256; ++i)
        {
            int arr[1024][1024];
            run2(arr);
        }
        auto duration = chrono::high_resolution_clock::now() - start;
        cout << "(run2) Total time: " << chrono::duration_cast<chrono::microseconds>(duration).count() << " microseconds\n";        
    }

    return 0;
}

我使用带有 -O3 标志的 g++ 版本 6.4.0 20180424 构建代码。下面是在ryzen 1700上运行的结果。

(run) Total time: 43493 microseconds
(run2) Total time: 134740 microseconds

我试图通过 godbolt.org 查看程序集(代码分隔在 2 个 url 中)

https://godbolt.org/g/aKSHH6

https://godbolt.org/g/zfK14x

但我仍然不明白到底是什么造成了差异。

所以我的问题是: 1. 性能差异的原因是什么？ 2. 是否有可能在参数中传递数组并具有与本地数组相同的性能？

编辑: 只是一些额外的信息，下面是使用 O2 构建的结果

(run) Total time: 94461 microseconds
(run2) Total time: 172352 microseconds

再次编辑: 从 xaxxon 的评论中，我尝试删除两个函数中的 initArr 调用。而且结果其实run2比run好

(run) Total time: 45151 microseconds
(run2) Total time: 35845 microseconds

但我还是不明白原因。

最佳答案

What's causing the performance difference?

编译器必须为 run2 生成代码，如果您调用，该代码将继续正常工作

run2(globalArr);

或者(更糟)，传入一些重叠但不相同的地址。

如果您允许您的 C++ 编译器内联调用，并且它选择这样做，它将能够生成内联代码，该代码知道该参数是否真的为您的全局别名。尽管如此，离线代码生成仍然必须是保守的。

Is it possible passing array in argument with the same performance as local array?

您当然可以使用 restrict 修复 C 中的别名问题。关键字，例如

void run2(int (* restrict globalArr2)[256])
{
    int (* restrict g)[256] = globalArr1;
    for(int i = 0; i < 32; ++i)
    {
        for(int j = 0; j < 256; ++j)
        {
            g[i][j] = globalArr2[i][j];
        }
    }
}

(或者可能在 C++ 中使用非标准扩展 __restrict)。

这应该允许优化器像在您的原始运行 中一样自由 - 除非它足够聪明以完全省略局部并简单地将全局设置为零。

关于c++ - 从 C++ 中的参数复制本地数组比数组更快？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50523123/

c++ - 从 C++ 中的参数复制本地数组比数组更快？

上一篇：c++ - 如何让 MyClass 的用户通过编译器标志控制类的哪个版本被实例化？

下一篇：c++ - If/else 循环 : C++ Program: Won't display final prompt/final loop