c++ - 使用线程的 vector 和没有加速

标签 c++ multithreading gcc c++11 blas

我有一个 C++ 程序,它基本上执行一些矩阵计算。对于这些,我使用 LAPACK/BLAS,通常根据平台链接到 MKL 或 ACML。许多这些矩阵计算在不同的独立矩阵上进行,因此我使用 std::thread 让这些操作并行运行。但是,我注意到使用更多线程时我没有加速。我将问题追溯到 daxpy Blas 例程。看起来如果两个线程并行使用这个例程,每个线程都会花费两倍的时间,即使这两个线程在不同的数组上操作。

接下来我尝试编写一个新的简单方法来执行 vector 加法以替换 daxpy 例程。对于一个线程,这种新方法与 BLAS 例程一样快,但是,当使用 gcc 编译时,它会遇到与 BLAS 例程相同的问题:并行运行的线程数量加倍也会使每个线程所需的时间加倍,所以没有加速。然而,使用英特尔 C++ 编译器,这个问题就消失了:随着线程数量的增加,单个线程所需的时间是恒定的。

但是,我还需要在没有英特尔编译器可用的系统上进行编译。所以我的问题是:为什么 gcc 没有加速,有没有可能提高 gcc 的性能?

我写了一个小程序来演示一下效果:

// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread

#include <iostream>
#include <thread>
#include <vector>

#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"

void simplesum(double* a, double* b, std::size_t dim);

int main() {

    for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
        const std::size_t N { 936 };

        std::vector <std::size_t> times(num_threads, 0);    

        auto threadfunction = [&](std::size_t tid)
        {
            const std::size_t dim { N * N };
            double* pA = new double[dim];
            double* pB = new double[dim];

            for (std::size_t i {0}; i < N; ++i){
                pA[i] = i;
                pB[i] = 2*i;
            }   

            boost::posix_time::ptime now1 = 
                boost::posix_time::microsec_clock::universal_time();    

            for (std::size_t n{0}; n < 1000; ++n){
                simplesum(pA, pB, dim);
            }

            boost::posix_time::ptime now2 = 
                boost::posix_time::microsec_clock::universal_time(); 
            boost::posix_time::time_duration dur = now2 - now1; 
            times[tid] += dur.total_milliseconds(); 
            delete[] pA;
            delete[] pB;
        };

        std::vector <std::thread> mythreads;

        // start threads
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads.emplace_back(threadfunction, n);
        }

        // wait for threads to finish
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads[n].join();
            std::cout << " Thread " << n+1 << " of " << num_threads 
                      << "  took " << times[n]<< "msec" << std::endl;
        }
    }
}

void simplesum(double* a, double* b, std::size_t dim){

    for(std::size_t i{0}; i < dim; ++i)
    {*(++a) += *(++b);}
}

gcc 的输出:

Thread 1 of 1  took 532msec
Thread 1 of 2  took 1104msec
Thread 2 of 2  took 1103msec
Thread 1 of 3  took 1680msec
Thread 2 of 3  took 1821msec
Thread 3 of 3  took 1808msec
Thread 1 of 4  took 2542msec
Thread 2 of 4  took 2536msec
Thread 3 of 4  took 2509msec
Thread 4 of 4  took 2515msec

与 icc 的输出:

Thread 1 of 1  took 663msec
Thread 1 of 2  took 674msec
Thread 2 of 2  took 674msec
Thread 1 of 3  took 681msec
Thread 2 of 3  took 681msec
Thread 3 of 3  took 681msec
Thread 1 of 4  took 688msec
Thread 2 of 4  took 689msec
Thread 3 of 4  took 687msec
Thread 4 of 4  took 688msec

因此,使用 icc,一个线程执行计算所需的时间是恒定的(正如我所预料的那样;我的 CPU 有 4 个物理内核),而使用 gcc,一个线程的时间会增加。用 BLAS::daxpy 替换 simplesum 例程对 icc 和 gcc 产生相同的结果(毫不奇怪,因为大部分时间都花在库中),这与上述 gcc 结果几乎相同。

最佳答案

答案很简单:你的线程正在争夺内存带宽!

假设您对每 2 次存储执行一次浮点加法(一次初始化,一次在加法之后)和 2 次读取(在加法中)。大多数提供多 CPU 的现代系统实际上必须在多个内核之间共享内存 Controller 。

以下是在具有 2 个物理 CPU 插槽和 12 个内核(24 个使用 HT)的系统上运行的。您的原始代码正是您的问题:

Thread 1 of 1  took 657msec
Thread 1 of 2  took 1447msec
Thread 2 of 2  took 1463msec
[...]
Thread 1 of 8  took 5516msec
Thread 2 of 8  took 5587msec
Thread 3 of 8  took 5205msec
Thread 4 of 8  took 5311msec
Thread 5 of 8  took 2731msec
Thread 6 of 8  took 5545msec
Thread 7 of 8  took 5551msec
Thread 8 of 8  took 4903msec

但是,通过简单地增加算术密度,我们可以看到可扩展性显着提高。为了演示,我将您的加法例程更改为也执行幂运算:*(++a) += std::exp(*(++b));。结果显示几乎完美的缩放:

Thread 1 of 1  took 7671msec
Thread 1 of 2  took 7759msec
Thread 2 of 2  took 7759msec
[...]
Thread 1 of 8  took 9997msec
Thread 2 of 8  took 8135msec
Thread 3 of 8  took 10625msec
Thread 4 of 8  took 8169msec
Thread 5 of 8  took 10054msec
Thread 6 of 8  took 8242msec
Thread 7 of 8  took 9876msec
Thread 8 of 8  took 8819msec

但是 ICC 呢?

首先,ICC 内联simplesum。证明内联发生很简单:使用 icc,我禁用了多文件过程间优化并将 simplesum 移动到它自己的翻译单元中。差异是惊人的。表演从

Thread 1 of 1  took 687msec
Thread 1 of 2  took 688msec
Thread 2 of 2  took 689msec
[...]
Thread 1 of 8  took 690msec
Thread 2 of 8  took 697msec
Thread 3 of 8  took 700msec
Thread 4 of 8  took 874msec
Thread 5 of 8  took 878msec
Thread 6 of 8  took 874msec
Thread 7 of 8  took 742msec
Thread 8 of 8  took 868msec

Thread 1 of 1  took 1278msec
Thread 1 of 2  took 2457msec
Thread 2 of 2  took 2445msec
[...]
Thread 1 of 8  took 8868msec
Thread 2 of 8  took 8434msec
Thread 3 of 8  took 7964msec
Thread 4 of 8  took 7951msec
Thread 5 of 8  took 8872msec
Thread 6 of 8  took 8286msec
Thread 7 of 8  took 5714msec
Thread 8 of 8  took 8241msec

这已经解释了库性能不佳的原因:ICC 无法内联它,因此无论其他什么导致 ICC 性能优于 g++,它都不会发生。

它还暗示了 ICC 在这里可能正在做什么...如果不是执行 simplesum 1000 次,它会如何 interchanges the loops这样它

  1. 加载两个 double
  2. 将它们相加 1000 次(甚至执行 a = 1000 * b)
  3. 存储两个 double

这将增加算术密度而不向函数添加任何指数...如何证明这一点?好吧,首先让我们简单地实现这个优化,看看会发生什么!为了分析,我们将看看 g++ 的性能。回顾一下我们的基准测试结果:

Thread 1 of 1  took 640msec
Thread 1 of 2  took 1308msec
Thread 2 of 2  took 1304msec
[...]
Thread 1 of 8  took 5294msec
Thread 2 of 8  took 5370msec
Thread 3 of 8  took 5451msec
Thread 4 of 8  took 5527msec
Thread 5 of 8  took 5174msec
Thread 6 of 8  took 5464msec
Thread 7 of 8  took 4640msec
Thread 8 of 8  took 4055msec

现在,让我们交换一下

for (std::size_t n{0}; n < 1000; ++n){
    simplesum(pA, pB, dim);
}

内循环变成外循环的版本:

double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
    double x = *a, y = *b;
    for (std::size_t n{0}; n < 1000; ++n)
    {
        x += y;
    }
    *a = x;
}

结果表明我们走在正确的轨道上:

Thread 1 of 1  took 693msec
Thread 1 of 2  took 703msec
Thread 2 of 2  took 700msec
[...]
Thread 1 of 8  took 920msec
Thread 2 of 8  took 804msec
Thread 3 of 8  took 750msec
Thread 4 of 8  took 943msec
Thread 5 of 8  took 909msec
Thread 6 of 8  took 744msec
Thread 7 of 8  took 759msec
Thread 8 of 8  took 904msec

这证明了循环交换优化确实是ICC在这里展示的优异性能的主要来源。

请注意,经过测试的编译器(MSVC、ICC、g++ 和 clang)都不会用乘法替换循环,这在单线程情况下将性能提高了 200 倍,在 8 线程情况下提高了 15 倍。这是因为重复加法的数值不稳定性在替换为单次乘法时可能会导致截然不同的结果。当使用整数数据类型而不是 float 据类型进行测试时,会发生这种优化。

我们如何强制 g++ 执行这种优化?

有趣的是,g++ 的真正 killer 不是无法执行循环交换。当使用 -floop-interchange 调用时, g++ 也可以执行此优化。但只有当机会显着增加对它有利时。

  1. 所有边界都表示为 ints,而不是 std::size_t。不是 long,不是 unsigned int,而是 int。我仍然很难相信,但这似乎是一个硬性要求。

  2. 索引而不是递增指针:a[i] += b[i];

  3. G++需要被告知-floop-interchange。一个简单的-O3是不够的。

当满足所有三个标准时,g++ 性能与 ICC 提供的性能相似:

Thread 1 of 1  took 714msec
Thread 1 of 2  took 724msec
Thread 2 of 2  took 721msec
[...]
Thread 1 of 8  took 782msec
Thread 2 of 8  took 1221msec
Thread 3 of 8  took 1225msec
Thread 4 of 8  took 781msec
Thread 5 of 8  took 788msec
Thread 6 of 8  took 1262msec
Thread 7 of 8  took 1226msec
Thread 8 of 8  took 820msec

注意:本实验中使用的 g++ 版本是 x64 Arch linux 上的 4.9.0。

关于c++ - 使用线程的 vector 和没有加速,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/23550907/

相关文章:

c++ - 附加到字符串不会更改字符串值

c++ - C++ 中二维数组初始化错误

c++ - 删除链表的头节点? C++

C++ std::lock 和 std::unique_lock 有什么区别?

java - java 序列中的服务(线程)

c - GCC 编译器执行 asm volatile ("nop"/::) 指令需要多长时间?

使用变量参数时 GCC 编译错误

c++ - Golang 中的移位指令

java - 这个 Singleton 是线程安全的吗?我看不出来怎么办?

c++ - Clang 上的重载运算符歧义,但 GCC 上没有,哪个是正确的?