c++ - 了解 std::hardware_破坏性_interference_size 和 std::hardware_constructive_interference_size

标签 c++ multithreading concurrency c++17 cpu-cache

添加了 C++17 std::hardware_destructive_interference_size and std::hardware_constructive_interference_size .首先,我认为这只是获取 L1 缓存行大小的一种可移植方式,但这是过于简单化了。

问题:

  • 这些常量与 L1 缓存行大小有何关系?
  • 是否有一个很好的例子来展示他们的用例?
  • 两者都定义为static constexpr。如果您构建二进制文件并在具有不同缓存行大小的其他机器上执行它,这不是问题吗?当您不确定您的代码将在哪台机器上运行时,它如何防止错误共享?

最佳答案

这些常量的目的确实是获取缓存行大小。了解它们的基本原理的最佳位置是提案本身:

http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0154r1.html

为了便于阅读,我将在此引用一段基本原理:

[...] the granularity of memory that does not interfere (to the first-order) [is] commonly referred to as the cache-line size.

Uses of cache-line size fall into two broad categories:

  • Avoiding destructive interference (false-sharing) between objects with temporally disjoint runtime access patterns from different threads.
  • Promoting constructive interference (true-sharing) between objects which have temporally local runtime access patterns.

The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. [...]

We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:

  • Destructive interference size: a number that’s suitable as an offset between two objects to likely avoid false-sharing due to different runtime access patterns from different threads.
  • Constructive interference size: a number that’s suitable as a limit on two objects’ combined memory footprint size and base alignment to likely promote true-sharing between them.

In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.


“这些常数与 L1 缓存行大小有何关系?”

理论上,相当直接。

假设编译器确切地知道您将在什么架构上运行 - 那么这些几乎肯定会准确地为您提供 L1 缓存行大小。 (如后所述,这是一个很大的假设。)

对于它的值(value),我几乎总是希望这些值是相同的。我相信它们被单独声明的唯一原因是为了完整性。 (也就是说,也许编译器想要估计 L2 缓存线大小而不是 L1 缓存线大小以进行建设性干扰;不过,我不知道这是否真的有用。)


“有没有很好的例子来展示他们的用例?”

在这个答案的底部,我附上了一个长的基准程序,演示了虚假共享和真实共享。

它通过分配一个 int 包装数组来演示虚假共享:在一种情况下,多个元素适合 L1 缓存行,而在另一种情况下,单个元素占用 L1 缓存行。在紧密循环中,从数组中选择一个固定元素并重复更新。

它通过在包装器中分配一对 int 来演示真正的共享:在一种情况下,这对中的两个 int 不适合 L1 缓存行大小,而在另一种情况下它们可以。在一个紧密的循环中,对中的每个元素都会重复更新。

请注意,访问被测对象的代码不会改变;唯一的区别是对象本身的布局和对齐方式。

我没有 C++17 编译器(假设大多数人目前也没有),所以我用我自己的替换了有问题的常量。您需要更新这些值以在您的机器上准确。也就是说,64 字节可能是典型现代桌面硬件上的正确值(在撰写本文时)。

警告:测试将使用您机器上的所有内核,并分配 ~256MB 内存。不要忘记进行优化编译!

在我的机器上,输出是:

Hardware concurrency: 16
sizeof(naive_int): 4
alignof(naive_int): 4
sizeof(cache_int): 64
alignof(cache_int): 64
sizeof(bad_pair): 72
alignof(bad_pair): 4
sizeof(good_pair): 8
alignof(good_pair): 4
Running naive_int test.
Average time: 0.0873625 seconds, useless result: 3291773
Running cache_int test.
Average time: 0.024724 seconds, useless result: 3286020
Running bad_pair test.
Average time: 0.308667 seconds, useless result: 6396272
Running good_pair test.
Average time: 0.174936 seconds, useless result: 6668457

I get ~3.5x speedup by avoiding false-sharing, and ~1.7x speedup by ensuring true-sharing.


"Both are defined static constexpr. Is that not a problem if you build a binary and execute it on other machines with different cache line sizes? How can it protect against false sharing in that scenario when you are not certain on which machine your code will be running?"

This will indeed be a problem. These constants are not guaranteed to map to any cache-line size on the target machine in particular, but are intended to be the best approximation the compiler can muster up.

This is noted in the proposal, and in the appendix they give an example of how some libraries try to detect cache-line size at compile time based on various environmental hints and macros. You are guaranteed that this value is at least alignof(max_align_t), which is an obvious lower bound.

In other words, this value should be used as your fallback case; you are free to define a precise value if you know it, e.g.:

constexpr std::size_t cache_line_size() {
#ifdef KNOWN_L1_CACHE_LINE_SIZE
  return KNOWN_L1_CACHE_LINE_SIZE;
#else
  return std::hardware_destructive_interference_size;
#endif
}

在编译期间,如果您想假设缓存行大小,只需定义 KNOWN_L1_CACHE_LINE_SIZE

希望这会有所帮助!

基准程序:

#include <chrono>
#include <condition_variable>
#include <cstddef>
#include <functional>
#include <future>
#include <iostream>
#include <random>
#include <thread>
#include <vector>

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_destructive_interference_size = 64;

// !!! YOU MUST UPDATE THIS TO BE ACCURATE !!!
constexpr std::size_t hardware_constructive_interference_size = 64;

constexpr unsigned kTimingTrialsToComputeAverage = 100;
constexpr unsigned kInnerLoopTrials = 1000000;

typedef unsigned useless_result_t;
typedef double elapsed_secs_t;

//////// CODE TO BE SAMPLED:

// wraps an int, default alignment allows false-sharing
struct naive_int {
    int value;
};
static_assert(alignof(naive_int) < hardware_destructive_interference_size, "");

// wraps an int, cache alignment prevents false-sharing
struct cache_int {
    alignas(hardware_destructive_interference_size) int value;
};
static_assert(alignof(cache_int) == hardware_destructive_interference_size, "");

// wraps a pair of int, purposefully pushes them too far apart for true-sharing
struct bad_pair {
    int first;
    char padding[hardware_constructive_interference_size];
    int second;
};
static_assert(sizeof(bad_pair) > hardware_constructive_interference_size, "");

// wraps a pair of int, ensures they fit nicely together for true-sharing
struct good_pair {
    int first;
    int second;
};
static_assert(sizeof(good_pair) <= hardware_constructive_interference_size, "");

// accesses a specific array element many times
template <typename T, typename Latch>
useless_result_t sample_array_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& vec) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    auto& element = vec[vec.size() / 2 + thread_index];

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        element.value = dist(mt);
    }

    return static_cast<useless_result_t>(element.value);
}

// accesses a pair's elements many times
template <typename T, typename Latch>
useless_result_t sample_pair_threadfunc(
    Latch& latch,
    unsigned thread_index,
    T& pair) {
    // prepare for computation
    std::random_device rd;
    std::mt19937 mt{ rd() };
    std::uniform_int_distribution<int> dist{ 0, 4096 };

    latch.count_down_and_wait();

    // compute
    for (unsigned trial = 0; trial != kInnerLoopTrials; ++trial) {
        pair.first = dist(mt);
        pair.second = dist(mt);
    }

    return static_cast<useless_result_t>(pair.first) +
        static_cast<useless_result_t>(pair.second);
}

//////// UTILITIES:

// utility: allow threads to wait until everyone is ready
class threadlatch {
public:
    explicit threadlatch(const std::size_t count) :
        count_{ count }
    {}

    void count_down_and_wait() {
        std::unique_lock<std::mutex> lock{ mutex_ };
        if (--count_ == 0) {
            cv_.notify_all();
        }
        else {
            cv_.wait(lock, [&] { return count_ == 0; });
        }
    }

private:
    std::mutex mutex_;
    std::condition_variable cv_;
    std::size_t count_;
};

// utility: runs a given function in N threads
std::tuple<useless_result_t, elapsed_secs_t> run_threads(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    threadlatch latch{ num_threads + 1 };

    std::vector<std::future<useless_result_t>> futures;
    std::vector<std::thread> threads;
    for (unsigned thread_index = 0; thread_index != num_threads; ++thread_index) {
        std::packaged_task<useless_result_t()> task{
            std::bind(func, std::ref(latch), thread_index)
        };

        futures.push_back(task.get_future());
        threads.push_back(std::thread(std::move(task)));
    }

    const auto starttime = std::chrono::high_resolution_clock::now();

    latch.count_down_and_wait();
    for (auto& thread : threads) {
        thread.join();
    }

    const auto endtime = std::chrono::high_resolution_clock::now();
    const auto elapsed = std::chrono::duration_cast<
        std::chrono::duration<double>>(
            endtime - starttime
            ).count();

    useless_result_t result = 0;
    for (auto& future : futures) {
        result += future.get();
    }

    return std::make_tuple(result, elapsed);
}

// utility: sample the time it takes to run func on N threads
void run_tests(
    const std::function<useless_result_t(threadlatch&, unsigned)>& func,
    const unsigned num_threads) {
    useless_result_t final_result = 0;
    double avgtime = 0.0;
    for (unsigned trial = 0; trial != kTimingTrialsToComputeAverage; ++trial) {
        const auto result_and_elapsed = run_threads(func, num_threads);
        const auto result = std::get<useless_result_t>(result_and_elapsed);
        const auto elapsed = std::get<elapsed_secs_t>(result_and_elapsed);

        final_result += result;
        avgtime = (avgtime * trial + elapsed) / (trial + 1);
    }

    std::cout
        << "Average time: " << avgtime
        << " seconds, useless result: " << final_result
        << std::endl;
}

int main() {
    const auto cores = std::thread::hardware_concurrency();
    std::cout << "Hardware concurrency: " << cores << std::endl;

    std::cout << "sizeof(naive_int): " << sizeof(naive_int) << std::endl;
    std::cout << "alignof(naive_int): " << alignof(naive_int) << std::endl;
    std::cout << "sizeof(cache_int): " << sizeof(cache_int) << std::endl;
    std::cout << "alignof(cache_int): " << alignof(cache_int) << std::endl;
    std::cout << "sizeof(bad_pair): " << sizeof(bad_pair) << std::endl;
    std::cout << "alignof(bad_pair): " << alignof(bad_pair) << std::endl;
    std::cout << "sizeof(good_pair): " << sizeof(good_pair) << std::endl;
    std::cout << "alignof(good_pair): " << alignof(good_pair) << std::endl;

    {
        std::cout << "Running naive_int test." << std::endl;

        std::vector<naive_int> vec;
        vec.resize((1u << 28) / sizeof(naive_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running cache_int test." << std::endl;

        std::vector<cache_int> vec;
        vec.resize((1u << 28) / sizeof(cache_int));  // allocate 256 mibibytes

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_array_threadfunc(latch, thread_index, vec);
        }, cores);
    }
    {
        std::cout << "Running bad_pair test." << std::endl;

        bad_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
    {
        std::cout << "Running good_pair test." << std::endl;

        good_pair p;

        run_tests([&](threadlatch& latch, unsigned thread_index) {
            return sample_pair_threadfunc(latch, thread_index, p);
        }, cores);
    }
}

关于c++ - 了解 std::hardware_破坏性_interference_size 和 std::hardware_constructive_interference_size,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39680206/

相关文章:

c# - 在 C# 中异步处理项目队列

postgresql - 将列从 timestamp 更改为 timestamptz 会锁定表吗?

asp.net-mvc - 在编辑操作中使用 RowVersion 的 ASP.NET MVC 并发

c++ - 从 C 到 C++ 并返回

c++ - 什么是 C++ 中的 .pyw 文件?

multithreading - julia: 可以玩多线程吗

c++ - 从 Qt 中的 QMainWindow 的构造函数启动一个新线程

c++ - 航类目的地检查程序-C++

c++ - 使用 pantheios 从多个进程并发写入日志文件

Go goroutine 泄漏