C++ 多线程运行时问题

标签 c++ multithreading

我一直在研究 C++ 多线程并得到一个关于它的问题。

这是我对多线程的理解。 我们使用多线程的原因之一是为了减少运行时间,对吧? 例如,我认为如果我们使用两个线程,我们可以期望执行时间减半。 因此,我尝试编写代码来证明这一点。 这是代码。

#include <vector>
#include <iostream>
#include <thread>
#include <future>

using namespace std;
#define iterationNumber 1000000

void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) { 
    clock_t begin,end;
    int firstIndex = index * numberInThread;
    int lastIndex = firstIndex + numberInThread;
    vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
    vector<int>::const_iterator last = numberList.cbegin() + lastIndex;

    vector<int> numbers(first,last);

    unsigned long result = 0;

    begin = clock();
    for(int i = 0 ; i < numbers.size(); i++) {
        result += numbers.at(i);
    }
    end = clock();
    cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;

    p.set_value(result);

}


int main(void)
{
    vector<int> numberList;
    vector<thread> t;
    vector<future<unsigned long>> futures;
    vector<unsigned long> result;
    const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
    int numberInThread = iterationNumber / NumberOfThreads;

    clock_t begin,end;


    for(int i = 0 ; i < iterationNumber ; i++) {
        int randomN =  rand() % 10000 + 1;
        numberList.push_back(randomN);
    }

    for(int j = 0 ; j < NumberOfThreads; j++){
        promise<unsigned long> promises;
        futures.push_back(promises.get_future());
        t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
    }

    for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));

    for (int i = 0; i < futures.size(); i++) {
        result.push_back(futures.at(i).get());
    }

    unsigned long RRR = 0;

    begin = clock();
    for(int i = 0 ; i < numberList.size(); i++) {
        RRR += numberList.at(i);
    }
    end = clock();
    cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;

}

因为我笔记本的硬件并发是4,所以会创建4个线程,每个线程取numberList的四分之一,然后对数字求和。

然而,结果和我想象的不一样。

thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654

为什么?为什么比串行版本花费更多时间(不是按线程)。

最佳答案

For example, I think if we use two threads we can expect half of the execution time.

您可能会这么想,但遗憾的是,实际情况往往并非如此。理想的“N 个内核意味着 1/N 的执行时间”场景仅在 N 个内核可以完全并行执行时出现,没有任何内核的操作干扰其他内核的性能。

但是您的线程正在做的只是对数组的不同子部分求和...当然可以从并行执行中获益吗?答案是原则上可以,但在现代 CPU 上,简单的加法速度快得令人眼花缭乱,以至于它并不是决定一个循环完成所需时间的真正因素。真正限制循环执行速度的是对 RAM 的访问。与 CPU 的速度相比,RAM 访问速度非常慢——而且在大多数台式计算机上,每个 CPU 都只有一个到 RAM 的连接,无论它有多少个内核。这意味着您在程序中真正测量的是从 RAM 读取大量整数到 CPU 的速度,并且该速度大致相同——等于 CPU 的内存总线带宽——无论是一个核心执行内存读入,还是四个核心。

为了演示有多少 RAM 访问是一个因素,下面是您的测试程序的修改/简化版本。在这个版本的程序中,我删除了大 vector ,而计算只是对(相对昂贵的)sin() 函数的一系列调用。请注意,在此版本中,循环仅访问几个内存位置,而不是数千个,因此运行计算循环的核心将不必定期等待更多数据从 RAM 复制到其本地缓存:

#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>

using namespace std;

static int iterationNumber = 1000000;

unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];

void myFunction(const int index, const int numberInThread)
{
   unsigned long result = 666;

   std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
   for(int i=0; i<numberInThread; i++) result += 100*sin(result);
   std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

   threadResults[index] = result;
   threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();

   // We'll print out the value of threadElapsedTimeMicros[index] later on,
   // after all the threads have been join()'d.
   // If we printed it out now it might affect the timing of the other threads
   // that may still be executing
}

int main(void)
{
    vector<thread> t;
    const int NumberOfThreads = thread::hardware_concurrency();
    const int numberInThread  = iterationNumber / NumberOfThreads;

    // Multithreaded approach
    std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
    for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
    for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
    std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();

    for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
    cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;

    // And now, the single-threaded approach, for comparison
    unsigned long result = 666;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

    cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
    return 0;
}

当我在我的双核 Mac mini(带有超线程的 i7)上运行上面的程序时,我得到的结果如下:

Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
 The computations in thread #0: result=1062, took 11718 microseconds
 The computations in thread #1: result=1062, took 11481 microseconds
 The computations in thread #2: result=1062, took 11525 microseconds
 The computations in thread #3: result=1062, took 11230 microseconds
 Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds

所以在这种情况下,结果更符合您的预期——因为内存访问不是瓶颈,每个核心都能够全速运行,并在大约 25 分钟内完成其占总计算量的 25%单线程完成 100% 计算所用时间的百分比...并且由于四个内核真正并行运行,因此计算所花费的总时间约为单线程所用时间的 33% - 要完成的线程例程(理想情况下为 25%,但启动和关闭线程等会涉及一些开销)。

关于C++ 多线程运行时问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37266845/

相关文章:

c++函数像printf

python - 使用 Raspberry pi 3、OpenCV 和 Python 的运动跟踪器

c++ - gcc -O0 在矩阵大小为 2 的幂(矩阵转置)上优于 -O3

java - 多线程启动方法

c - 将在打开时清除文件内容导致错误 C

PHP 多个 Ajax 请求 : First request block second request

c# - 异步多播委托(delegate)

c++ - 99% CPU,3.51MB 没有 typedef

c++ - 函数钩子(Hook)地址复制错误

C++14线程/条件变量误解