C++:将映射文件读入矩阵的快速方法

我正在尝试将映射文件读入矩阵。该文件是这样的:

name;phone;city\n
Luigi Rossi;02341567;Milan\n
Mario Bianchi;06567890;Rome\n
....

而且它很小。我编写的代码可以正常工作，但不是那么快:

#include <iostream>
#include <fstream>
#include <string>
#include <boost/iostreams/device/mapped_file.hpp>

using namespace std;

int main() {

    int i;
    int j=0;
    int k=0;

    vector< vector<char> > M(10000000, vector<string>(3));

    mapped_file_source file("file.csv");

    // Check if file was successfully opened
    if(file.is_open()) {

      // Get pointer to the data
      const char * c = (const char *)file.data();

      int size=file.size();

      for(i = 0; i < (size+1); i++){

       if(c[i]=='\n' || i==size){
        j=j+1;
        k=0;
       }else if(c[i]==';'){
        k=k+1;
       }else{
        M[j][k]+=c[i];
       }    
     }//end for


   }//end if    

 return(0)


}

有没有更快的方法？我读过一些关于 memcyp 的内容，但我不知道如何使用它来 boost 我的代码。

最佳答案

我有很多这样的例子/类似的写在 SO 上。

让我列出最相关的:

我已经完成了很多这样的基准测试。是的，对于顺序读取，read/scanf 有一个小边缘(参见例如 scanf/iostreams and files vs. mappings 和 parsing floats 或 read being slightly faster for 1-pass sequential read)。
一个有趣的方法是延迟解析(为什么要将整个输入复制到内存中？那么内存映射有什么意义)。这里的答案显示了这种方法(在那里模拟 multimap ):
- Using boost::iostreams::mapped_file_source with std::multimap (方法 #2)

在所有其他情况下，考虑在其上猛击灵气工作，可能使用 boost::string_ref而不是 vector<char> (当然，除非映射文件不是“const”)。

string_ref也显示在之前链接的最后一个答案中。另一个有趣的例子(延迟转换为未转义的字符串值)在这里 How to parse mustache with Boost.Xpressive correctly?

演示

这是 Qi 作业猛烈抨击它:

它在 2.9 秒内将约 3200 万行的 994 MiB 文件解析为 vector
```
struct Line {
    boost::string_ref name, city;
    long id;
};
```
请注意，我们解析数字，并通过引用它们在内存映射中的位置 + 长度 (string_ref) 来存储字符串
它漂亮地打印了 10 个随机行的数据
如果一次性在vector中预留32m个元素，运行最快可达2.5s；在这种情况下，程序只会进行一次内存分配。
注意:在 64 位系统上，如果平均行长度小于 40 字节，内存表示会增长到大于输入大小。这是因为 string_ref是16个字节。

Live On Coliru

#include <boost/fusion/adapted/struct.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/iostreams/device/mapped_file.hpp>
#include <boost/utility/string_ref.hpp>

namespace qi = boost::spirit::qi;
using sref   = boost::string_ref;

namespace boost { namespace spirit { namespace traits {
    template <typename It>
    struct assign_to_attribute_from_iterators<sref, It, void> {
        static void call(It f, It l, sref& attr) { attr = { f, size_t(std::distance(f,l)) }; }
    };
} } }

struct Line {
    sref name, city;
    long id;
};

BOOST_FUSION_ADAPT_STRUCT(Line, (sref,name)(long,id)(sref,city))

int main() {
    boost::iostreams::mapped_file_source mmap("input.txt");

    using namespace qi;

    std::vector<Line> parsed;
    parsed.reserve(32000000);
    if (phrase_parse(mmap.begin(), mmap.end(), 
                omit[+graph] >> eol >>
                (raw[*~char_(";\r\n")] >> ';' >> long_ >> ';' >> raw[*~char_(";\r\n")]) % eol,
                qi::blank, parsed))
    {
        std::cout << "Parsed " << parsed.size() << " lines\n";
    } else {
        std::cout << "Failed after " << parsed.size() << " lines\n";
    }

    std::cout << "Printing 10 random items:\n";
    for(int i=0; i<10; ++i) {
        auto& line = parsed[rand() % parsed.size()];
        std::cout << "city: '" << line.city << "', id: " << line.id << ", name: '" << line.name << "'\n";
    }
}

生成的输入类似

do grep -v "'" /etc/dictionaries-common/words | sort -R | xargs -d\\n -n 3 | while read a b c; do echo "$a $b;$RANDOM;$c"; done

输出是例如

Parsed 31609499 lines
Printing 10 random items:
city: 'opted', id: 14614, name: 'baronets theosophy'
city: 'denominated', id: 24260, name: 'insignia ophthalmic'
city: 'mademoiselles', id: 10791, name: 'smelter orienting'
city: 'ducked', id: 32155, name: 'encircled flippantly'
city: 'garotte', id: 3080, name: 'keeling South'
city: 'emirs', id: 14511, name: 'Aztecs vindicators'
city: 'characteristically', id: 5473, name: 'constancy Troy'
city: 'savvy', id: 3921, name: 'deafer terrifically'
city: 'misfitted', id: 14617, name: 'Eliot chambray'
city: 'faceless', id: 24481, name: 'shade forwent'

关于C++:将映射文件读入矩阵的快速方法，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/28340430/

C++:将映射文件读入矩阵的快速方法

演示

上一篇：c++ - 使用通过 auto 关键字创建的类型的表达式模板中的段错误

下一篇：c++ - 分配器感知容器和 propagate_on_container_swap