我正在编写一个程序来处理图表。我在循环中定义了一个图,并向其中添加了顶点和边(大约 4k 个顶点和 40k 个边)。但是当一轮循环结束后,需要很长时间才能开始下一轮循环。
typedef adjacency_list <listS, vecS, bidirectionalS, property<vertex_name_t, LineLabel>> LineGraph
for (int i = 0; i < num_comp; i++) {
if (num_vertices(subgraphs[i]) == 1) {
continue;
}
cout << "Subgraph " << i + 1 << ":" << endl;
cout << "has " << num_vertices(subgraphs[i]) << " vertices" << endl;
LineGraph llineg;
get_line_graph(subgraphs[i], llineg);//vertices and edges are added here
}
调试时,似乎循环结束时,程序会跳转到名为scoped_ptr.hpp的文件,并运行
~scoped_ptr() BOOST_SP_NOEXCEPT
{
#if defined(BOOST_SP_ENABLE_DEBUG_HOOKS)
boost::sp_scalar_destructor_hook( px );
#endif
boost::checked_delete( px );
}
然后它转到checked_delete.hpp并运行
template<class T> inline void checked_delete(T * x) BOOST_NOEXCEPT
{
// intentionally complex - simplification causes regressions
typedef char type_must_be_complete[ sizeof(T)? 1: -1 ];
(void) sizeof(type_must_be_complete);
delete x;
}
我猜图中的边太多了,所以需要很长时间才能将它们从内存中删除。
我应该做什么才能让它更快?
我对这个问题做了一个小测试。该图有 10617 个顶点和 72172 条边。
int main() {
typedef adjacency_list<setS, vecS, bidirectionalS> G;
timer t;
for (int j = 0; j < 2; j++) {
cout << t.elapsed() << endl;
int num_v = 10617;
int num_e = 72172;
G g;
for (int i = 0; i < num_v; i++) {
add_vertex(g);
}
ifstream infile("wordassociation-2011.txt"); //read edges from file
int a, b;
char c;
int i = 0;
for (int j = 0; j < num_a; ++j) {
infile >> a >> c >> b;
add_edge(a, b, g);
i++;
if (i % 5000 == 0) {
cout << "reading..." << endl;
}
}
cout << t.elapsed() << endl;
}
}
结果显示,大约需要2分钟才能开始下一个循环。 picture of result 0 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 阅读... 0.981 122.192 阅读... 正在阅读...
最佳答案
所以,我不能一个人呆着:)
我从 http://w3.usf.edu/FreeAssociation/AppendixA/index.html 下载了数据集(显然这是您的来源,因为计数完全匹配)。
我编写了一个解析器,它读取 Row
数据类型:
// part of speech
enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };
using Atom = AtomPool::Atom;
struct Association {
Atom cue, target;
bool normed_;
unsigned _g; // norm_group
unsigned _p; // partic_response
double fsg; // forward_strength
double bsg; // backward_strength
double msg; // mediated_strength "2-step strength"
double osg; // overlapping_strength
unsigned _m; // mediators
unsigned mmia; // mediators_missing_in_action
double _o; // overlapping_associates shared by cue and target
unsigned omia; // overlapping_missing_in_action
};
struct WordAttributes {
unsigned ss; // set size
unsigned fr; // frequency
double con; // concreteness
bool h; // is homograph?
POS ps; // part of speech
double mc; // mean connectivity among its associates
double pr; // probablity of a resonant connection
double rsg; // resonant strength
unsigned uc; // use code
};
struct Row {
Association assoc;
WordAttributes qattr, tattr;
};
我通过使用字符串原子进行了一些优化,因为字符串可能会被大量重用。我是对的(请参阅下面的演示打印的统计数据)。
只是解析器
// #define BOOST_SPIRIT_X3_DEBUG
#include <boost/container/flat_set.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/spirit/home/x3.hpp>
#include <iomanip>
#include <iostream>
#include <chrono>
#include <iomanip>
#include <iostream>
#include <utility>
namespace {
using boost::container::flat_set;
static long elapsed() {
auto now = std::chrono::high_resolution_clock::now;
using namespace std::chrono_literals;
static auto start = now();
auto t = now();
return (t - std::exchange(start, t)) / 1ms;
}
void trace(auto const&... args) {
((std::cout << std::setw(10) << elapsed() << "ms ") << ... << args) << std::endl;
}
} // namespace
struct AtomPool {
using Atom = std::uint32_t; // Backing::size_type;
static constexpr Atom max() { return std::numeric_limits<Atom>::max(); }
size_t size() const { return _index.size(); }
size_t size_bytes() const { return _backing.size(); }
AtomPool() {
_backing.reserve(100'000);
_index.reserve(12'000);
}
~AtomPool() {
_index.clear();
_backing.clear();
if (_insertions)
trace("~AtomPool deduplicated ", (_dedup * 100.0 / _insertions), "% of ", _insertions,
" insertions");
}
private:
using Backing = std::vector<char>;
struct Cmp {
AtomPool const& _pool;
using is_transparent = void;
template <typename... T> bool operator()(T const&... v) const { return (view(v) < ...); }
private:
std::string_view view(Atom id) const { return _pool.lookup(id); }
std::string_view view(std::string_view sv) const { return sv; }
};
Backing _backing;
flat_set<Atom, Cmp> _index{Cmp{*this}};
size_t _insertions = 0, _dedup = 0;
public:
Atom atomize(std::string const& text) {
if (_backing.size() > max())
throw std::range_error("AtomPool");
_insertions += 1;
if (auto it = _index.find(text); it != _index.end()) {
_dedup += 1;
return *it;
}
auto pos = _backing.size();
_backing.insert(_backing.end(), begin(text), end(text));
_backing.push_back('\0');
_index.insert(pos);
return pos;
}
std::string_view lookup(Atom id) const {
assert(id < _backing.size());
return _backing.data() + id;
}
};
namespace DataSet {
/*
* |--------------+-----------------------------------------------------|
* | Abbreviation | Equivalence |
* |==============+=====================================================|
* | CUE | Normed Word |
* | TARGET | Response to Normed Word |
* | NORMED? | Is Response Normed? |
* | #G | Group size |
* | # | P Number of Participants Producing Response |
* | FSG | Forward Cue-to-Target Strength |
* | BSG | Backward Target-to-Cue Strength |
* | MSG | Mediated Strength |
* | OSG | Overlapping Associate Strength |
* | # | M Number of Mediators |
* | MMIA | Number of Non-Normed Potential Mediating Associates |
* | # | O Number of Overlaping Associates |
* | OMIA | Number of Non-Normed Overlapping Associates |
* | QSS | Cue: Set Size |
* | QFR | Cue: Frequency |
* | QCON | Cue: Concreteness |
* | QH | Cue is a Homograph? |
* | QPS | Cue: Part of Speech |
* | QMC | Cue: Mean Connectivity Among Its Associates |
* | QPR | Cue: Probability of a Resonant Connection |
* | QRSG | Cue: Resonant Strength |
* | QUC | Cue: Use Code |
* | TSS | Target: Set Size |
* | TFR | Target: Frequency |
* | TCON | Target: Concreteness |
* | TH | Target is a Homograph? |
* | TPS | Target: Part of Speech |
* | TMC | Target: Mean Connectivity Among Its Assoc. |
* | TPR | Target: Probability of a Resonant Connection |
* | TRSG | Target: Resonant Strength |
* | TUC | Target: Use Code |
* |--------------+-----------------------------------------------------|
*/
enum ColId {
CUE, TARGET, NORMED_, _G, _P, FSG, BSG, MSG, OSG, _M, MMIAS, _O, OMIAS, QSS, //
QFR, QCON, QH, QPS, QMC, QPR, QRSG, QUC, TSS, TFR, TCON, TH, TPS, TMC, TPR, //
TRSG, TUC, //
NUMCOLUMNS
};
// part of speech
enum POS { Noun, Verb, Adjective, Adverb, Pronoun, Preposition, Interjection, Conjunction, Other };
using Atom = AtomPool::Atom;
struct Association {
Atom cue, target;
bool normed_;
unsigned _g; // norm_group
unsigned _p; // partic_response
double fsg; // forward_strength
double bsg; // backward_strength
double msg; // mediated_strength "2-step strength"
double osg; // overlapping_strength
unsigned _m; // mediators
unsigned mmia; // mediators_missing_in_action
double _o; // overlapping_associates shared by cue and target
unsigned omia; // overlapping_missing_in_action
};
struct WordAttributes {
unsigned ss; // set size
unsigned fr; // frequency
double con; // concreteness
bool h; // is homograph?
POS ps; // part of speech
double mc; // mean connectivity among its associates
double pr; // probablity of a resonant connection
double rsg; // resonant strength
unsigned uc; // use code
};
struct Row {
Association assoc;
WordAttributes qattr, tattr;
};
} // namespace DataSet
//
#include <boost/fusion/adapted.hpp>
BOOST_FUSION_ADAPT_STRUCT(DataSet::WordAttributes, ss, fr, con, h, ps, mc, pr, rsg, uc)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Association, cue, target, normed_, _g, _p, fsg, bsg, msg, osg, _m, mmia, _o, omia)
BOOST_FUSION_ADAPT_STRUCT(DataSet::Row, assoc, qattr, tattr)
struct Reader {
using Atom = DataSet::Atom;
using Association = DataSet::Association;
using Attributes = DataSet::WordAttributes;
using POS = DataSet::POS;
using Row = DataSet::Row;
using File = std::vector<Row>;
Reader(std::string const& fname, AtomPool& pool) : _atoms(pool), _mfs{fname} { parse(); }
private:
AtomPool& _atoms;
boost::iostreams::mapped_file_source _mfs;
File _rows;
using It = char const*;
void parse() try {
It const f = _mfs.begin(), l = _mfs.end();
long const total_bytes = l - f;
_rows.reserve(total_bytes / 150);
namespace x3 = boost::spirit::x3;
namespace enc = x3::standard; // iso8859_1;
using enc::blank;
using enc::char_;
using enc::lit;
using enc::space;
// using enc::symbols;
auto atom = [this](auto& ctx) {
auto& vec = _attr(ctx);
_val(ctx) = _atoms.atomize(std::string(vec.begin(), vec.end()));
};
auto normed = x3::symbols<bool>{}.add("NO", false)("YES", true).sym;
normed.name("YES|NO");
auto col = x3::rule<Atom, Atom>{"col"} = x3::lexeme[*~char_(",\r\n")][atom];
auto header = x3::eps //
>> "CUE" >> ',' //
>> "TARGET" >> ',' //
>> "NORMED?" >> ',' //
>> "#G" >> ',' //
>> "#P" >> ',' //
>> "FSG" >> ',' //
>> "BSG" >> ',' //
>> "MSG" >> ',' //
>> "OSG" >> ',' //
>> "#M" >> ',' //
>> "MMIAS" >> ',' //
>> "#O" >> ',' //
>> "OMIAS" >> ',' //
>> "QSS" >> ',' //
>> "QFR" >> ',' //
>> "QCON" >> ',' //
>> "QH" >> ',' //
>> "QPS" >> ',' //
>> "QMC" >> ',' //
>> "QPR" >> ',' //
>> "QRSG" >> ',' //
>> "QUC" >> ',' //
>> "TSS" >> ',' //
>> "TFR" >> ',' //
>> "TCON" >> ',' //
>> "TH" >> ',' //
>> "TPS" >> ',' //
>> "TMC" >> ',' //
>> "TPR" >> ',' //
>> "TRSG" >> ',' //
>> "TUC" >> x3::eol;
auto yen = lit('\xa5') | x3::iso8859_1::lit("¥");
auto eoc = &(',' | x3::eol | x3::eoi); // end-of-column predicate
auto strength //
= x3::double_ >> eoc //
| -yen >> x3::attr(NAN);
auto count //
= x3::uint_ >> &(',' | x3::eol | x3::eoi) //
| -yen >> x3::attr(-1u);
auto homograph = x3::matches[enc::graph - ','];
auto part_of_speech = //
x3::symbols<POS>{}
.add("N", POS::Noun) //
("V", POS::Verb) //
("AJ", POS::Adjective) //
("AD", POS::Adverb) //
("P", POS::Pronoun) //
("PP", POS::Preposition) //
("I", POS::Interjection) //
("C", POS::Conjunction) //
// these were undocumented but deduced from context
("INT", POS::Interjection) //
("A", POS::Adverb) //
("AV", POS::Adverb) //
("ADV", POS::Adverb) //
("ADJ", POS::Adjective) //
("PRP", POS::Pronoun) // "ACCOUNT" probably mistagged in dataset
.sym |
eoc >> x3::attr(POS::Other);
auto attributes = x3::rule<Attributes, Attributes>{"attributes"} = x3::eps
> count > ',' // SS
> count > ',' // FR
> strength > ',' // CON
> homograph > ',' // H
> part_of_speech > ',' // PS
> strength > ',' // M
> strength > ',' // PR
> strength > ',' // RSG
> count; // UC
auto association = x3::rule<Association, Association>{"association"} = (*header >> !x3::eoi)
> col > ',' // CUE
> col > ',' // TARGET
> normed > ',' // NORMED?
> count > ',' // #G
> count > ',' // #P
> strength > ',' // FSG
> strength > ',' // BSG
> strength > ',' // MSG
> strength > ',' // OSG
> count > ',' // #M
> count > ',' // MMIAS
> strength > ',' // #O
> count; // OMIAS
auto row = x3::rule<Row, Row>{"row"} = //
(*header >> !x3::eoi) //
> association > ',' // assoc
> attributes > ',' // qattr
> attributes; // tattr
auto file = x3::rule<File, File>{"file"} = *(row >> x3::eol);
phrase_parse(f, l, x3::eps > file > x3::eoi, blank, _rows);
using DataSet::NUMCOLUMNS;
trace("Parsed: ", total_bytes / 1024.0 / 1024, "MiB ", "into ", _rows.size(), " rows");
trace(_atoms.size(), " atoms totaling ", _atoms.size_bytes(), " bytes of backing storage");
} catch (boost::spirit::x3::expectation_failure<It> const& ef) {
It b = _mfs.begin(), e = _mfs.end();
auto w = ef.where();
auto sol = b + std::string_view(b, w).find_last_of("\r\n") + 1;
auto line = std::string_view(sol, e);
line = line.substr(0, line.find_first_of("\r\n"));
auto l = 1 + std::count(b, w, '\n'), c = 1 + (w - sol);
trace("Expected ", ef.which(), " at L:", l, ":", c, " at\n", //
" ", line, "\n", //
std::setw(c + 4), '^', "----- here"); //
std::exit(1);
}
};
int main(int argc, char** argv) {
AtomPool pool;
trace("Start");
{
for (std::string fname : std::vector(argv+1, argv+argc))
Reader r{fname, pool};
}
trace("Done");
}
对于 small demo dataset 打印:
./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
0ms Start
54ms Parsed: 0.00769901MiB into 40 rows
0ms 70 atoms totaling 462 bytes of backing storage
0ms Done
0ms ~AtomPool deduplicated 12.5% of 80 insertions
在本地,整个数据集的解析时间约为 90 毫秒:
/build/sotest dataset/Cue_Target_Pairs.* dataset/full.txt
0ms Start
14ms Parsed: 1.15371MiB into 8476 rows
0ms 3732 atoms totaling 26643 bytes of backing storage
13ms Parsed: 1.03647MiB into 7590 rows
0ms 5340 atoms totaling 39238 bytes of backing storage
19ms Parsed: 1.42218MiB into 10500 rows
0ms 6901 atoms totaling 51720 bytes of backing storage
15ms Parsed: 1.14505MiB into 8440 rows
0ms 7917 atoms totaling 59821 bytes of backing storage
14ms Parsed: 1.26198MiB into 9287 rows
0ms 8709 atoms totaling 66346 bytes of backing storage
15ms Parsed: 1.35513MiB into 9888 rows
0ms 9456 atoms totaling 72769 bytes of backing storage
14ms Parsed: 1.25216MiB into 9214 rows
0ms 10055 atoms totaling 77640 bytes of backing storage
15ms Parsed: 1.19052MiB into 8781 rows
0ms 10617 atoms totaling 82271 bytes of backing storage
86ms Parsed: 9.8172MiB into 72176 rows
0ms 10617 atoms totaling 82271 bytes of backing storage
0ms Done
0ms ~AtomPool deduplicated 96.3225% of 288704 insertions
创建图表
让我们创建所有(规范和非规范)单词作为顶点,并将所有关联(输入行)添加为边。
#include <boost/graph/adjacency_list.hpp>
using VertexProp = DataSet::WordAttributes;
using EdgeProp = DataSet::Association;
using Graph = boost::adjacency_list<boost::vecS, boost::vecS, boost::directedS, VertexProp, EdgeProp>;
使用从原子到顶点描述符的临时映射,buildGraph
的实现可以非常简单:
Graph buildGraph() const {
using V = Graph::vertex_descriptor;
Graph g;
g.m_vertices.reserve(_atoms.size());
std::map<Atom, V> vmap; // not flat-map for stability
for (auto& [assoc, qattr, tattr] : _rows) {
auto s = vmap.find(assoc.cue);
auto t = vmap.find(assoc.target);
if (s == vmap.end()) s = vmap.emplace(assoc.cue, add_vertex(qattr, g)).first;
if (t == vmap.end()) t = vmap.emplace(assoc.target, add_vertex(tattr, g)).first;
assert(g[s->second] == qattr); // check consistent input for qattr/tattr
assert(g[t->second] == tattr);
add_edge(s->second, t->second, assoc, g);
}
return g;
}
在我们的main
中,让我们在解析后构建图表:
int main(int argc, char** argv) {
AtomPool pool;
trace("Start");
for (std::string fname : std::vector(argv + 1, argv + argc)) {
Reader r{fname, pool};
trace("Parsed ", fname);
{
Graph g = r.buildGraph();
trace("Graph creation done; ", num_vertices(g), " vertices and ", num_edges(g), " edges");
}
trace("Graph destruction done");
}
trace("Done");
}
./a.out /Archive2/bd/de4002f5d4d3ee/main.cpp
0ms Start
322ms Parsed: 0.00769901MiB into 40 rows
0ms Parsed /Archive2/bd/de4002f5d4d3ee/main.cpp
0ms Graph creation done; 70 vertices and 40 edges
0ms Graph destruction done
0ms Done
0ms ~AtomPool deduplicated 12.5% of 80 insertions
对于完整数据集,输出变为:
0ms Start
93ms Parsed: 9.8172MiB into 72176 rows
0ms Parsed dataset/full.txt
16ms Graph creation done; 10617 vertices and 72176 edges
1ms Graph destruction done
0ms Done
0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
NOTE
There's four more edges than you reported in your question. Perhaps you haven't added the undocumented part-of-speech codes like I did?
正如您所看到的,破坏几乎不需要时间。这甚至从 Reader
复制所有数据 - 因此可以在创建图形后处置 Reader
- 而且顶点/边属性将是可变的。
如果您不需要,您可以使用指针来提高内存效率。
listS
EdgeContainerSelector?
我没有看到 significant difference ,尽管图销毁现在确实需要多花 1 毫秒:
0ms Start
90ms Parsed: 9.8172MiB into 72176 rows
0ms Parsed dataset/full.txt
13ms Graph creation done; 10617 vertices and 72176 edges
2ms Graph destruction done
0ms Done
0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
双向
?
这在应用程序领域似乎没有多大意义。并且它具有大量的存储和插入开销。然而,在某些访问模式下,这些可能会在提高查询性能方面得到返回。 YMMV。
将 Live 或本地数据与完整数据集进行比较:
0ms Start
90ms Parsed: 9.8172MiB into 72176 rows
0ms Parsed dataset/full.txt
17ms Graph creation done; 10617 vertices and 72176 edges
6ms Graph destruction done
0ms Done
0ms ~AtomPool deduplicated 92.6451% of 144352 insertions
正如预测的那样,增加低效率确实出现在时间安排中。尽管破坏速度慢了 6 倍,但仍然只有 6 毫秒,这非常重要。
关于c++ - 为什么一个图(大约 10k 到 100k 条边和顶点)需要很长时间才能删除?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75947227/