c++ - 如何提高 std::function 监听器的分发性能？

简而言之，是否有任何明显的方法可以使下面代码中的 distributor.distribute() 调用运行得更快？

#include <iostream>
#include <memory>
#include <functional>
#include <vector>
#include <typeindex>
#include <unordered_map>
#include <chrono>


// ---------------------------------------------------------------------
// Things to get passed around
// ---------------------------------------------------------------------
class Base {
public:
  virtual ~Base() {};
};
class Derived : public Base {};

// ---------------------------------------------------------------------
// Base class for our Handler class so we can store them in a container
// ---------------------------------------------------------------------
class BaseHandler
{
public:
  virtual ~BaseHandler() {};
  virtual void handle(std::shared_ptr<const Base> ptr) = 0;
};

// ---------------------------------------------------------------------
// Handler class to wrap a std::function. This is helpful because it
// allows us to add metadata to the function call such as call priority
// (not implemented here for simplification)
// ---------------------------------------------------------------------
template <typename T>
class Handler : public BaseHandler
{
public:
  Handler(std::function<void(std::shared_ptr<const T>)> handlerFn)
  : handlerFn(handlerFn) {};
  void handle(std::shared_ptr<const Base> ptr) override {
    handlerFn(std::static_pointer_cast<const T>(ptr));
  }
private:
  std::function<void(std::shared_ptr<const T>)> handlerFn;
};

// ---------------------------------------------------------------------
// Distributor keeps a record of listeners by type and calls them when a
// corresponding object of that type needs to be distributed.
// ---------------------------------------------------------------------
class Distributor
{
public:
  template <typename T>
  void addHandler(std::shared_ptr<Handler<T>> handler)
  {
    handlerMap[std::type_index(typeid(T))].emplace_back(handler);
  }
  void distribute(std::shared_ptr<const Base> basePtr)
  {
    const Base& base = *basePtr;
    std::type_index typeIdx(typeid(base));

    for(auto& handler : handlerMap[typeIdx])
    {
      handler->handle(basePtr);
    }
  }
private:
  std::unordered_map<std::type_index, std::vector<std::shared_ptr<BaseHandler>>> handlerMap;
};

// ---------------------------------------------------------------------
// Benchmarking code
// ---------------------------------------------------------------------

// Test handler function
void handleDerived(std::shared_ptr<const Derived> derived) { }

int main ()
{
  size_t iters = 10000000;
  size_t numRuns = 10;

  Distributor distributor;

  // add our test handler
  distributor.addHandler(std::make_shared<Handler<Derived>>(&handleDerived));

  std::cout << "Raw Func Call\t|\tDistributor\t|\tRatio" << std::endl;
  std::cout << "-------------\t|\t-----------\t|\t-----" << std::endl;

  for(size_t i = 0; i < numRuns; i++)
  {
    auto evt = std::make_shared<Derived>();

    // time raw function calls
    auto start = std::chrono::steady_clock::now();
    for (size_t i = 0; i < iters; i++) {
      handleDerived(evt);
    }
    auto d = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start);

    // time calls through the distributor
    start = std::chrono::steady_clock::now();
    for (size_t i = 0; i < iters; i++) {
      distributor.distribute(evt);
    }
    auto d2 = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start);

    std::cout << d.count() << "\t\t|\t" << d2.count() << "\t\t|\t" << (d2*1.0/d) << std::endl;
  }


}

运行 MinGW-W64 g++ 8.1.0 且使用 -O3 标志优化的 Windows 10 计算机上的结果:

Raw Func Call   |       Distributor     |       Ratio
-------------   |       -----------     |       -----
256             |       1256            |       4.90625
258             |       1224            |       4.74419
273             |       1222            |       4.47619
246             |       1261            |       5.12602
270             |       1257            |       4.65556
248             |       1276            |       5.14516
272             |       1274            |       4.68382
265             |       1208            |       4.55849
240             |       1224            |       5.1
239             |       1163            |       4.86611

如您所见，分发器调用开销导致速度降低约 4.5-5 倍(与从指向非 const 的指针到指向 const 的指针所需的转换相比) 。尽管如此，在保持给定设计模式的同时，是否有任何明确的方法可以改进这一点？

处理程序应该被给予shared_ptr，因为我希望它们能够保留对传递对象的引用(如果它们愿意的话)。但他们实际上可能想也可能不想保留对它的引用。

我想知道是否有某种方法可以通过避免 shared_ptr 复制构造来提高性能，但我不确定最好的方法。

编辑:这个设计有几个方面对我来说非常重要。它们如下:

我的实际用例要求原始 shared_ptr 必须是指向非 const 的指针，并且 shared_ptr处理程序接收到的 code>必须是指向 const 的指针。因此，我本质上是将 distribute 调用的成本与调用引发该转换作为引用点的函数的成本进行比较。
Distributor 类的用户不需要需要担心转换。任何到 Base 和返回到 Derived 类的转换对于用户来说都应该是不可见的。
我希望支持几乎任何类型的处理函数(lambda、仿函数、成员函数、函数指针等)，但如果限制性更强所带来的性能优势非常显着，那么我可能会改变主意。

代码其他方面(例如注册监听器)的效率改进也值得欢迎，但并不是那么重要。最关心的是让Distributor尽可能高效地调用所有监听器。

最佳答案

旁注:

当函数按值获取 std::shared_ptr 时，涉及追逐指针(潜在的缓存未命中)和原子增量(相对昂贵的操作)。避免按值获取 std::shared_ptr。

首先，更改:

void distribute(std::shared_ptr<const Base> basePtr)

至:

void distribute(std::shared_ptr<const Base> const& basePtr)

然后在其他地方。

但在较高层面上，您可以将直接调用 handleDerived 的成本与以下调用的成本进行比较:

执行 typeid 调用，
哈希查找，
vector 迭代，
虚拟调用，
通过函数指针调用。

这是很大的开销。您可以通过避免这些虚拟调用来减少一点:

#include <iostream>
#include <memory>
#include <functional>
#include <vector>
#include <typeindex>
#include <unordered_map>
#include <chrono>

struct Base {
    virtual ~Base() {};
};
struct Derived :  Base {};

class Distributor
{
public:
    template <class T, typename F>
    void addHandler(F&& handler) {
        handlerMap[std::type_index(typeid(T))].emplace_back(std::forward<F>(handler));
    }

    void distribute(std::shared_ptr<const Base> const& basePtr) {
        std::type_index typeIdx(typeid(*basePtr));
        for(auto& handler : handlerMap[typeIdx])
            handler(basePtr);
    }

private:
    std::unordered_map<std::type_index, std::vector<std::function<void(std::shared_ptr<const Base> const&)>>> handlerMap;
};

void handleDerived(std::shared_ptr<const Derived> const&) { }

int main ()
{
    size_t iters = 10000000;
    size_t numRuns = 10;

    Distributor distributor;

    // add our test handler
    distributor.addHandler<Derived>([](std::shared_ptr<const Base> const& p) { 
        handleDerived(std::static_pointer_cast<const Derived>(p)); 
    });

    std::cout << "Raw Func Call\t|\tDistributor\t|\tRatio" << std::endl;
    std::cout << "-------------\t|\t-----------\t|\t-----" << std::endl;

    for(size_t i = 0; i < numRuns; i++)
    {
        auto evt = std::make_shared<Derived>();

        // time raw function calls
        auto start = std::chrono::steady_clock::now();
        for (size_t i = 0; i < iters; i++) {
            handleDerived(evt);
        }
        auto d = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start);

        // time calls through the distributor
        start = std::chrono::steady_clock::now();
        for (size_t i = 0; i < iters; i++) {
            distributor.distribute(evt);
        }
        auto d2 = std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::steady_clock::now() - start);

        std::cout << d.count() << "\t\t|\t" << d2.count() << "\t\t|\t" << (d2*1.0/d) << std::endl;
    }
}

输出:

Raw Func Call   |       Distributor     |       Ratio
-------------   |       -----------     |       -----
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556
72              |       238             |       3.30556

在我的机器上，初始比率是 4.5。

关于c++ - 如何提高 std::function 监听器的分发性能？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/50993095/

c++ - 如何提高 std::function 监听器的分发性能？

上一篇：c++ - 如何使用 "Modern CMake"设置编译器标志？

下一篇：c++ - Ret (&)(Args...) 和 Ret (Args...) & 有什么区别？