C++ 多线程运行时问题

我一直在研究 C++ 多线程并得到一个关于它的问题。

这是我对多线程的理解。我们使用多线程的原因之一是为了减少运行时间，对吧？例如，我认为如果我们使用两个线程，我们可以期望执行时间减半。因此，我尝试编写代码来证明这一点。这是代码。

#include <vector>
#include <iostream>
#include <thread>
#include <future>

using namespace std;
#define iterationNumber 1000000

void myFunction(const int index, const int numberInThread, promise<unsigned long>&& p, const vector<int>& numberList) { 
    clock_t begin,end;
    int firstIndex = index * numberInThread;
    int lastIndex = firstIndex + numberInThread;
    vector<int>::const_iterator first = numberList.cbegin() + firstIndex;
    vector<int>::const_iterator last = numberList.cbegin() + lastIndex;

    vector<int> numbers(first,last);

    unsigned long result = 0;

    begin = clock();
    for(int i = 0 ; i < numbers.size(); i++) {
        result += numbers.at(i);
    }
    end = clock();
    cout << "thread" << index << " took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;

    p.set_value(result);

}


int main(void)
{
    vector<int> numberList;
    vector<thread> t;
    vector<future<unsigned long>> futures;
    vector<unsigned long> result;
    const int NumberOfThreads = thread::hardware_concurrency() ?: 2;
    int numberInThread = iterationNumber / NumberOfThreads;

    clock_t begin,end;


    for(int i = 0 ; i < iterationNumber ; i++) {
        int randomN =  rand() % 10000 + 1;
        numberList.push_back(randomN);
    }

    for(int j = 0 ; j < NumberOfThreads; j++){
        promise<unsigned long> promises;
        futures.push_back(promises.get_future());
        t.push_back(thread(myFunction, j, numberInThread, std::move(promises), numberList));
    }

    for_each(t.begin(), t.end(), std::mem_fn(&std::thread::join));

    for (int i = 0; i < futures.size(); i++) {
        result.push_back(futures.at(i).get());
    }

    unsigned long RRR = 0;

    begin = clock();
    for(int i = 0 ; i < numberList.size(); i++) {
        RRR += numberList.at(i);
    }
    end = clock();
    cout << "not by thread took " << ((float)(end-begin))/CLOCKS_PER_SEC << endl;

}

因为我笔记本的硬件并发是4，所以会创建4个线程，每个线程取numberList的四分之一，然后对数字求和。

然而，结果和我想象的不一样。

thread0 took 0.007232
thread1 took 0.007402
thread2 took 0.010035
thread3 took 0.011759
not by thread took 0.009654

为什么？为什么比串行版本花费更多时间(不是按线程)。

最佳答案

For example, I think if we use two threads we can expect half of the execution time.

您可能会这么想，但遗憾的是，实际情况往往并非如此。理想的“N 个内核意味着 1/N 的执行时间”场景仅在 N 个内核可以完全并行执行时出现，没有任何内核的操作干扰其他内核的性能。

但是您的线程正在做的只是对数组的不同子部分求和...当然可以从并行执行中获益吗？答案是原则上可以，但在现代 CPU 上，简单的加法速度快得令人眼花缭乱，以至于它并不是决定一个循环完成所需时间的真正因素。真正限制循环执行速度的是对 RAM 的访问。与 CPU 的速度相比，RAM 访问速度非常慢——而且在大多数台式计算机上，每个 CPU 都只有一个到 RAM 的连接，无论它有多少个内核。这意味着您在程序中真正测量的是从 RAM 读取大量整数到 CPU 的速度，并且该速度大致相同——等于 CPU 的内存总线带宽——无论是一个核心执行内存读入，还是四个核心。

为了演示有多少 RAM 访问是一个因素，下面是您的测试程序的修改/简化版本。在这个版本的程序中，我删除了大 vector ，而计算只是对(相对昂贵的)sin() 函数的一系列调用。请注意，在此版本中，循环仅访问几个内存位置，而不是数千个，因此运行计算循环的核心将不必定期等待更多数据从 RAM 复制到其本地缓存:

#include <vector>
#include <iostream>
#include <thread>
#include <chrono>
#include <math.h>

using namespace std;

static int iterationNumber = 1000000;

unsigned long long threadElapsedTimeMicros[10];
unsigned long threadResults[10];

void myFunction(const int index, const int numberInThread)
{
   unsigned long result = 666;

   std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
   for(int i=0; i<numberInThread; i++) result += 100*sin(result);
   std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

   threadResults[index] = result;
   threadElapsedTimeMicros[index] = std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count();

   // We'll print out the value of threadElapsedTimeMicros[index] later on,
   // after all the threads have been join()'d.
   // If we printed it out now it might affect the timing of the other threads
   // that may still be executing
}

int main(void)
{
    vector<thread> t;
    const int NumberOfThreads = thread::hardware_concurrency();
    const int numberInThread  = iterationNumber / NumberOfThreads;

    // Multithreaded approach
    std::chrono::steady_clock::time_point allBegin = std::chrono::steady_clock::now();
    for(int j = 0 ; j < NumberOfThreads; j++) t.push_back(thread(myFunction, j, numberInThread));
    for(int j = 0 ; j < NumberOfThreads; j++) t[j].join();
    std::chrono::steady_clock::time_point allEnd = std::chrono::steady_clock::now();

    for(int j = 0 ; j < NumberOfThreads; j++) cout << " The computations in thread #" << j << ": result=" << threadResults[j] << ", took " << threadElapsedTimeMicros[j] << " microseconds" << std::endl;
    cout << " Total time spent doing multithreaded computations was " << std::chrono::duration_cast<std::chrono::microseconds>(allEnd - allBegin).count() << " microseconds in total" << std::endl;

    // And now, the single-threaded approach, for comparison
    unsigned long result = 666;
    std::chrono::steady_clock::time_point begin = std::chrono::steady_clock::now();
    for(int i = 0 ; i < iterationNumber; i++) result += 100*sin(result);
    std::chrono::steady_clock::time_point end = std::chrono::steady_clock::now();

    cout << "result=" << result << ", single-threaded computation took " << std::chrono::duration_cast<std::chrono::microseconds>(end - begin).count() << " microseconds" << std::endl;
    return 0;
}

当我在我的双核 Mac mini(带有超线程的 i7)上运行上面的程序时，我得到的结果如下:

Jeremys-Mac-mini:~ lcsuser1$ g++ -std=c++11 -O3 ./temp.cpp
Jeremys-Mac-mini:~ lcsuser1$ ./a.out
 The computations in thread #0: result=1062, took 11718 microseconds
 The computations in thread #1: result=1062, took 11481 microseconds
 The computations in thread #2: result=1062, took 11525 microseconds
 The computations in thread #3: result=1062, took 11230 microseconds
 Total time spent doing multithreaded computations was 16492 microseconds in total
result=1181, single-threaded computation took 49846 microseconds

所以在这种情况下，结果更符合您的预期——因为内存访问不是瓶颈，每个核心都能够全速运行，并在大约 25 分钟内完成其占总计算量的 25%单线程完成 100% 计算所用时间的百分比...并且由于四个内核真正并行运行，因此计算所花费的总时间约为单线程所用时间的 33% - 要完成的线程例程(理想情况下为 25%，但启动和关闭线程等会涉及一些开销)。

关于C++ 多线程运行时问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37266845/

C++ 多线程运行时问题

上一篇：c++ - 比较器 -1073741819 (0xC0000005)

下一篇：c++ - 如何避免 C 字符串的许多类似重载