c++ - 为什么一个图(大约 10k 到 100k 条边和顶点)需要很长时间才能删除?

标签 c++ graph boost-graph

我正在编写一个程序来处理图表。我在循环中定义了一个图,并向其中添加了顶点和边(大约 4k 个顶点和 40k 个边)。但是当一轮循环结束后,需要很长时间才能开始下一轮循环。

typedef adjacency_list <listS, vecS, bidirectionalS, property<vertex_name_t, LineLabel>> LineGraph
for (int i = 0; i < num_comp; i++) {
        if (num_vertices(subgraphs[i]) == 1) {
            continue;
        }
        cout << "Subgraph " << i + 1 << ":" << endl;
        cout << "has " << num_vertices(subgraphs[i]) << " vertices" << endl;
        LineGraph llineg;
        get_line_graph(subgraphs[i], llineg);//vertices and edges are added here
    }

调试时,似乎循环结束时,程序会跳转到名为scoped_ptr.hpp的文件,并运行

~scoped_ptr() BOOST_SP_NOEXCEPT
    {
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
        boost::sp_scalar_destructor_hook( px );
#endif
        boost::checked_delete( px );
    }

然后它转到checked_delete.hpp并运行

template<class T> inline void checked_delete(T * x) BOOST_NOEXCEPT
{
    // intentionally complex - simplification causes regressions
    typedef char type_must_be_complete[ sizeof(T)? 1: -1 ];
    (void) sizeof(type_must_be_complete);
    delete x;
}

我猜图中的边太多了,所以需要很长时间才能将它们从内存中删除。

我应该做什么才能让它更快?

我对这个问题做了一个小测试。该图有 10617 个顶点和 72172 条边。

int main() {
    typedef adjacency_list<setS, vecS, bidirectionalS> G;
    timer t;
    for (int j = 0; j < 2; j++) {
        cout << t.elapsed() << endl;
        int num_v = 10617;
        int num_e = 72172;
        G g;
        for (int i = 0; i < num_v; i++) {
            add_vertex(g);
        }
        ifstream infile("wordassociation-2011.txt"); //read edges from file
        int a, b;
        char c;
        int i = 0;
        for (int j = 0; j < num_a; ++j) {
            infile >> a >> c >> b;
            add_edge(a, b, g);
            i++;
            if (i % 5000 == 0) {
                cout << "reading..." << endl;
            }
        }
        cout << t.elapsed() << endl;
    }
}

结果显示,大约需要2分钟才能开始下一个循环。 picture of result 0 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 0.981 122.192 阅读... 正在阅读...

最佳答案

所以,我不能一个人呆着:)

我从 http://w3.usf.edu/FreeAssociation/AppendixA/index.html 下载了数据集(显然这是您的来源,因为计数完全匹配)。

我编写了一个解析器,它读取 Row 数据类型:

// part of speech
enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

using Atom = AtomPool::Atom;

struct Association {
    Atom     cue, target;
    bool     normed_;
    unsigned _g;   // norm_group
    unsigned _p;   // partic_response
    double   fsg;  // forward_strength
    double   bsg;  // backward_strength
    double   msg;  // mediated_strength "2-step strength"
    double   osg;  // overlapping_strength
    unsigned _m;   // mediators
    unsigned mmia; // mediators_missing_in_action
    double   _o;   // overlapping_associates shared by cue and target
    unsigned omia; // overlapping_missing_in_action
};

struct WordAttributes {
    unsigned ss;  // set size
    unsigned fr;  // frequency
    double   con; // concreteness
    bool     h;   // is homograph?
    POS      ps;  // part of speech
    double   mc;  // mean connectivity among its associates
    double   pr;  // probablity of a resonant connection
    double   rsg; // resonant strength
    unsigned uc;  // use code
};

struct Row {
    Association    assoc;
    WordAttributes qattr, tattr;
};

我通过使用字符串原子进行了一些优化,因为字符串可能会被大量重用。我是对的(请参阅下面的演示打印的统计数据)。

只是解析器

Live On Coliru

// #define BOOST_SPIRIT_X3_DEBUG
#include <boost/container/flat_set.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/x3.hpp>

#include <iomanip>
#include <iostream>

#include <chrono>
#include <iomanip>
#include <iostream>
#include <utility>

namespace {
    using boost::container::flat_set;

    static long elapsed() {
        auto now = std::chrono::high_resolution_clock::now;
        using namespace std::chrono_literals;
        static auto start = now();
        auto        t     = now();

        return (t - std::exchange(start, t)) / 1ms;
    }

    void trace(auto const&... args) {
        ((std::cout << std::setw(10) << elapsed() << "ms ") << ... << args) << std::endl;
    }
} // namespace

struct AtomPool {
    using Atom = std::uint32_t; // Backing::size_type;
    static constexpr Atom max() { return std::numeric_limits<Atom>::max(); }

    size_t size() const { return _index.size(); }
    size_t size_bytes() const { return _backing.size(); }

    AtomPool() {
        _backing.reserve(100'000);
        _index.reserve(12'000);
    }

    ~AtomPool() {
        _index.clear();
        _backing.clear();
        if (_insertions)
            trace("~AtomPool deduplicated ", (_dedup * 100.0 / _insertions), "% of ", _insertions,
                  " insertions");
    }

  private:
    using Backing = std::vector<char>;

    struct Cmp {
        AtomPool const& _pool;
        using is_transparent = void;

        template <typename... T> bool operator()(T const&... v) const { return (view(v) < ...); }

      private:
        std::string_view view(Atom id) const { return _pool.lookup(id); }
        std::string_view view(std::string_view sv) const { return sv; }
    };

    Backing             _backing;
    flat_set<Atom, Cmp> _index{Cmp{*this}};
    size_t              _insertions = 0, _dedup = 0;

  public:
    Atom atomize(std::string const& text) {
        if (_backing.size() > max())
            throw std::range_error("AtomPool");

        _insertions += 1;

        if (auto it = _index.find(text); it != _index.end()) {
            _dedup += 1;
            return *it;
        }

        auto pos = _backing.size();
        _backing.insert(_backing.end(), begin(text), end(text));
        _backing.push_back('\0');

        _index.insert(pos);
        return pos;
    }

    std::string_view lookup(Atom id) const {
        assert(id < _backing.size());
        return _backing.data() + id;
    }
};

namespace DataSet {

    /*
     * |--------------+-----------------------------------------------------|
     * | Abbreviation | Equivalence                                         |
     * |==============+=====================================================|
     * | CUE          | Normed Word                                         |
     * | TARGET       | Response to Normed Word                             |
     * | NORMED?      | Is Response Normed?                                 |
     * | #G           | Group size                                          |
     * | #            | P  Number of Participants Producing Response        |
     * | FSG          | Forward Cue-to-Target Strength                      |
     * | BSG          | Backward Target-to-Cue Strength                     |
     * | MSG          | Mediated Strength                                   |
     * | OSG          | Overlapping Associate Strength                      |
     * | #            | M  Number of Mediators                              |
     * | MMIA         | Number of Non-Normed Potential Mediating Associates |             
     * | #            | O  Number of Overlaping Associates                  |
     * | OMIA         | Number of Non-Normed Overlapping Associates         |
     * | QSS          | Cue: Set Size                                       |
     * | QFR          | Cue: Frequency                                      |
     * | QCON         | Cue: Concreteness                                   |
     * | QH           | Cue is a Homograph?                                 |
     * | QPS          | Cue: Part of Speech                                 |
     * | QMC          | Cue: Mean Connectivity Among Its Associates         |
     * | QPR          | Cue: Probability of a Resonant Connection           |
     * | QRSG         | Cue: Resonant Strength                              |
     * | QUC          | Cue: Use Code                                       |
     * | TSS          | Target: Set Size                                    |
     * | TFR          | Target: Frequency                                   |
     * | TCON         | Target: Concreteness                                |
     * | TH           | Target is a Homograph?                              |
     * | TPS          | Target: Part of Speech                              |
     * | TMC          | Target: Mean Connectivity Among Its Assoc.          |
     * | TPR          | Target: Probability of a Resonant Connection        |
     * | TRSG         | Target: Resonant Strength                           |
     * | TUC          | Target: Use Code                                    |
     * |--------------+-----------------------------------------------------|
     */
    enum ColId {
        CUE, TARGET, NORMED_, _G, _P, FSG, BSG, MSG, OSG, _M, MMIAS, _O, OMIAS, QSS, //
        QFR, QCON, QH, QPS, QMC, QPR, QRSG, QUC, TSS, TFR, TCON, TH, TPS, TMC, TPR, //
        TRSG, TUC, //
        NUMCOLUMNS
    };

    // part of speech
    enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };

    using Atom = AtomPool::Atom;

    struct Association {
        Atom     cue, target;
        bool     normed_;
        unsigned _g;   // norm_group
        unsigned _p;   // partic_response
        double   fsg;  // forward_strength
        double   bsg;  // backward_strength
        double   msg;  // mediated_strength "2-step strength"
        double   osg;  // overlapping_strength
        unsigned _m;   // mediators
        unsigned mmia; // mediators_missing_in_action
        double   _o;   // overlapping_associates shared by cue and target
        unsigned omia; // overlapping_missing_in_action
    };

    struct WordAttributes {
        unsigned ss;  // set size
        unsigned fr;  // frequency
        double   con; // concreteness
        bool     h;   // is homograph?
        POS      ps;  // part of speech
        double   mc;  // mean connectivity among its associates
        double   pr;  // probablity of a resonant connection
        double   rsg; // resonant strength
        unsigned uc;  // use code
    };

    struct Row {
        Association    assoc;
        WordAttributes qattr, tattr;
    };
} // namespace DataSet
  //
#include <boost/fusion/adapted.hpp>

BOOST_FUSION_ADAPT_STRUCT(DataSet::WordAttributes, ss, fr, con, h, ps, mc, pr, rsg, uc)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Association, cue, target, normed_, _g, _p, fsg, bsg, msg, osg, _m, mmia, _o, omia)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Row, assoc, qattr, tattr)

struct Reader {
    using Atom        = DataSet::Atom;
    using Association = DataSet::Association;
    using Attributes  = DataSet::WordAttributes;
    using POS         = DataSet::POS;

    using Row  = DataSet::Row;
    using File = std::vector<Row>;

    Reader(std::string const& fname, AtomPool& pool) : _atoms(pool), _mfs{fname} { parse(); }

  private:
    AtomPool&                            _atoms;
    boost::iostreams::mapped_file_source _mfs;
    File                                 _rows;

    using It = char const*;

    void parse() try {
        It const f = _mfs.begin(), l = _mfs.end();

        long const total_bytes = l - f;
        _rows.reserve(total_bytes / 150);
        namespace x3  = boost::spirit::x3;
        namespace enc = x3::standard; // iso8859_1;
        using enc::blank;
        using enc::char_;
        using enc::lit;
        using enc::space;
        // using enc::symbols;

        auto atom = [this](auto& ctx) {
            auto& vec = _attr(ctx);
            _val(ctx) = _atoms.atomize(std::string(vec.begin(), vec.end()));
        };

        auto normed = x3::symbols<bool>{}.add("NO", false)("YES", true).sym;
        normed.name("YES|NO");

        auto col = x3::rule<Atom, Atom>{"col"} = x3::lexeme[*~char_(",\r\n")][atom];

        auto header = x3::eps   //
            >> "CUE" >> ','     //
            >> "TARGET" >> ','  //
            >> "NORMED?" >> ',' //
            >> "#G" >> ','      //
            >> "#P" >> ','      //
            >> "FSG" >> ','     //
            >> "BSG" >> ','     //
            >> "MSG" >> ','     //
            >> "OSG" >> ','     //
            >> "#M" >> ','      //
            >> "MMIAS" >> ','   //
            >> "#O" >> ','      //
            >> "OMIAS" >> ','   //
            >> "QSS" >> ','     //
            >> "QFR" >> ','     //
            >> "QCON" >> ','    //
            >> "QH" >> ','      //
            >> "QPS" >> ','     //
            >> "QMC" >> ','     //
            >> "QPR" >> ','     //
            >> "QRSG" >> ','    //
            >> "QUC" >> ','     //
            >> "TSS" >> ','     //
            >> "TFR" >> ','     //
            >> "TCON" >> ','    //
            >> "TH" >> ','      //
            >> "TPS" >> ','     //
            >> "TMC" >> ','     //
            >> "TPR" >> ','     //
            >> "TRSG" >> ','    //
            >> "TUC" >> x3::eol;

        auto yen = lit('\xa5') | x3::iso8859_1::lit("¥");

        auto eoc = &(',' | x3::eol | x3::eoi); // end-of-column predicate

        auto strength            //
            = x3::double_ >> eoc //
            | -yen >> x3::attr(NAN);

        auto count                                    //
            = x3::uint_ >> &(',' | x3::eol | x3::eoi) //
            | -yen >> x3::attr(-1u);

        auto homograph = x3::matches[enc::graph - ','];
        auto part_of_speech = //
            x3::symbols<POS>{}
                .add("N", POS::Noun) //
            ("V", POS::Verb)         //
            ("AJ", POS::Adjective)   //
            ("AD", POS::Adverb)      //
            ("P", POS::Pronoun)      //
            ("PP", POS::Preposition) //
            ("I", POS::Interjection) //
            ("C", POS::Conjunction)  //
            // these were undocumented but deduced from context
            ("INT", POS::Interjection) //
            ("A", POS::Adverb)         //
            ("AV", POS::Adverb)        //
            ("ADV", POS::Adverb)       //
            ("ADJ", POS::Adjective)    //
            ("PRP", POS::Pronoun)      // "ACCOUNT" probably mistagged in dataset
                .sym |
            eoc >> x3::attr(POS::Other);

        auto attributes = x3::rule<Attributes, Attributes>{"attributes"} = x3::eps

            > count > ','          // SS
            > count > ','          // FR
            > strength > ','       // CON
            > homograph > ','      // H
            > part_of_speech > ',' // PS
            > strength > ','       // M
            > strength > ','       // PR
            > strength > ','       // RSG
            > count;               // UC

        auto association = x3::rule<Association, Association>{"association"} = (*header >> !x3::eoi)

            > col > ','      // CUE
            > col > ','      // TARGET
            > normed > ','   // NORMED?
            > count > ','    // #G
            > count > ','    // #P
            > strength > ',' // FSG
            > strength > ',' // BSG
            > strength > ',' // MSG
            > strength > ',' // OSG
            > count > ','    // #M
            > count > ','    // MMIAS
            > strength > ',' // #O
            > count;         // OMIAS

        auto row = x3::rule<Row, Row>{"row"} = //
            (*header >> !x3::eoi)              //
            > association > ','                // assoc
            > attributes > ','                 // qattr
            > attributes;                      // tattr

        auto file = x3::rule<File, File>{"file"} = *(row >> x3::eol);

        phrase_parse(f, l, x3::eps > file > x3::eoi, blank, _rows);

        using DataSet::NUMCOLUMNS;
        trace("Parsed: ", total_bytes / 1024.0 / 1024, "MiB ", "into ", _rows.size(), " rows");

        trace(_atoms.size(), " atoms totaling ", _atoms.size_bytes(), " bytes of backing storage");
    } catch (boost::spirit::x3::expectation_failure<It> const& ef) {
        It   b = _mfs.begin(), e = _mfs.end();
        auto w = ef.where();
        auto sol = b + std::string_view(b, w).find_last_of("\r\n") + 1;

        auto line = std::string_view(sol, e);
        line      = line.substr(0, line.find_first_of("\r\n"));

        auto l = 1 + std::count(b, w, '\n'), c = 1 + (w - sol);
        trace("Expected ", ef.which(), " at L:", l, ":", c, " at\n", //
              "    ", line, "\n",                                    //
              std::setw(c + 4), '^', "----- here");                  //
        std::exit(1);
    }
};

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    {
        for (std::string fname : std::vector(argv+1, argv+argc))
            Reader r{fname, pool};
    }
    trace("Done");
}

对于 small demo dataset 打印:

./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Start
    54ms Parsed: 0.00769901MiB into 40 rows
     0ms 70 atoms totaling 462 bytes of backing storage
     0ms Done
     0ms ~AtomPool deduplicated 12.5% of 80 insertions

在本地,整个数据集的解析时间约为 90 毫秒:

/build/sotest dataset/Cue_Target_Pairs.* dataset/full.txt 

     0ms Start
    14ms Parsed: 1.15371MiB into 8476 rows
     0ms 3732 atoms totaling 26643 bytes of backing storage
    13ms Parsed: 1.03647MiB into 7590 rows
     0ms 5340 atoms totaling 39238 bytes of backing storage
    19ms Parsed: 1.42218MiB into 10500 rows
     0ms 6901 atoms totaling 51720 bytes of backing storage
    15ms Parsed: 1.14505MiB into 8440 rows
     0ms 7917 atoms totaling 59821 bytes of backing storage
    14ms Parsed: 1.26198MiB into 9287 rows
     0ms 8709 atoms totaling 66346 bytes of backing storage
    15ms Parsed: 1.35513MiB into 9888 rows
     0ms 9456 atoms totaling 72769 bytes of backing storage
    14ms Parsed: 1.25216MiB into 9214 rows
     0ms 10055 atoms totaling 77640 bytes of backing storage
    15ms Parsed: 1.19052MiB into 8781 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
    86ms Parsed: 9.8172MiB into 72176 rows
     0ms 10617 atoms totaling 82271 bytes of backing storage
     0ms Done
     0ms ~AtomPool deduplicated 96.3225% of 288704 insertions

创建图表

让我们创建所有(规范和非规范)单词作为顶点,并将所有关联(输入行)添加为边。

#include <boost/graph/adjacency_list.hpp>
using VertexProp = DataSet::WordAttributes;
using EdgeProp   = DataSet::Association;
using Graph      = boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, VertexProp, EdgeProp>;

使用从原子到顶点描述符的临时映射,buildGraph 的实现可以非常简单:

Graph buildGraph() const {
    using V = Graph::vertex_descriptor;

    Graph g;
    g.m_vertices.reserve(_atoms.size());

    std::map<Atom, V> vmap; // not flat-map for stability

    for (auto& [assoc, qattr, tattr] : _rows) {
        auto s = vmap.find(assoc.cue);
        auto t = vmap.find(assoc.target);
        if (s == vmap.end()) s = vmap.emplace(assoc.cue, add_vertex(qattr, g)).first;
        if (t == vmap.end()) t = vmap.emplace(assoc.target, add_vertex(tattr, g)).first;

        assert(g[s->second] == qattr); // check consistent input for qattr/tattr
        assert(g[t->second] == tattr);

        add_edge(s->second, t->second, assoc, g);
    }

    return g;
}

在我们的main中,让我们在解析后构建图表:

int main(int argc, char** argv) {
    AtomPool pool;

    trace("Start");
    for (std::string fname : std::vector(argv + 1, argv + argc)) {
        Reader r{fname, pool};
        trace("Parsed ", fname);
        {
            Graph g = r.buildGraph();
            trace("Graph creation done; ", num_vertices(g), " vertices and ", num_edges(g), " edges");
        }
        trace("Graph destruction done");
    }
    trace("Done");
}

Live On Coliru

./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Start
   322ms Parsed: 0.00769901MiB into 40 rows
     0ms Parsed /Archive2/bd/de4002f5d4d3ee/main.cpp
     0ms Graph creation done; 70 vertices and 40 edges
     0ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 12.5% of 80 insertions

对于完整数据集,输出变为:

     0ms Start
    93ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    16ms Graph creation done; 10617 vertices and 72176 edges
     1ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

NOTE

There's four more edges than you reported in your question. Perhaps you haven't added the undocumented part-of-speech codes like I did?

enter image description here

正如您所看到的,破坏几乎不需要时间。这甚至从 Reader 复制所有数据 - 因此可以在创建图形后处置 Reader - 而且顶点/边属性将是可变的。

如果您不需要,您可以使用指针来提高内存效率。

listS EdgeContainerSelector?

我没有看到 significant difference ,尽管图销毁现在确实需要多花 1 毫秒:

     0ms Start
    90ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    13ms Graph creation done; 10617 vertices and 72176 edges
     2ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

双向

这在应用程序领域似乎没有多大意义。并且它具有大量的存储和插入开销。然而,在某些访问模式下,这些可能会在提高查询性能方面得到返回。 YMMV。

Live 或本地数据与完整数据集进行比较:

     0ms Start
    90ms Parsed: 9.8172MiB into 72176 rows
     0ms Parsed dataset/full.txt
    17ms Graph creation done; 10617 vertices and 72176 edges
     6ms Graph destruction done
     0ms Done
     0ms ~AtomPool deduplicated 92.6451% of 144352 insertions

正如预测的那样,增加低效率确实出现在时间安排中。尽管破坏速度慢了 6 倍,但仍然只有 6 毫秒,这非常重要。

1 The correlation between forward and backward strength for cues whose targets have been normed is positive but not high, r = .29 (n = 63,619), and the chances of correctly guessing back strengths from knowledge of forward strengths are low

关于c++ - 为什么一个图(大约 10k 到 100k 条边和顶点)需要很长时间才能删除?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75947227/

相关文章:

c++ - Make 无法识别正在修改的文件

c++ - '...' : double free or corruption (fasttop): [address] *** 中的 OpenMP *** 错误

c++ - 关于使用捆绑属性创建 C++ Boost 图

c++ - 复制 Boost 标记图

c++ - 如何使用 JNI 在 C++ 中加载 .jar 文件

c++ - 如何在 GCC 中启用 C/C++ "Conditional with Omitted Operand"(又名 Elvis Operator "?:")

R图度数分布不起作用

swift - 通过过滤器谓词进行下采样算法?

c# - 大型无向图的高效存储、查找和操作

c++ - astar_search 在以 listS 作为顶点容器的图上?