c++ - 如何让我的拆分只在一个真实的行上工作并且能够跳过字符串的引用部分?

标签 c++ string parsing boost split

所以我们有一个 simple split :

#include <iostream>
#include <string>
#include <vector>
#include <algorithm>
#include <iterator>
using namespace std;

vector<string> split(const string& s, const string& delim, const bool keep_empty = true) {
    vector<string> result;
    if (delim.empty()) {
        result.push_back(s);
        return result;
    }
    string::const_iterator substart = s.begin(), subend;
    while (true) {
        subend = search(substart, s.end(), delim.begin(), delim.end());
        string temp(substart, subend);
        if (keep_empty || !temp.empty()) {
            result.push_back(temp);
        }
        if (subend == s.end()) {
            break;
        }
        substart = subend + delim.size();
    }
    return result;
}

boost split .我们有简单的主要内容:

int main() {
    const vector<string> words = split("close no \"\n matter\" how \n far", " ");
    copy(words.begin(), words.end(), ostream_iterator<string>(cout, "\n"));
}

如何让它输出类似的东西

close 
no
"\n matter"
how
end symbol found.

我们想介绍拆分结构,它应该保持不拆分,并且字符应该结束解析过程。如何做这样的事情?

最佳答案

已更新 通过“感谢您”授予我的奖金,我实现了 4 个我最初跳过的功能,因为“您不需要它”。

  1. 现在支持部分引用的列

    This is the problem you reported: e.g. with a delimiter , only test,"one,two",three would be valid, not test,one","two","three. Now both are accepted

  2. 现在支持自定义分隔符表达式

    You could only specify single characters as delimiters. Now you can specify any Spirit Qi parser expression as the delimiter rule. E.g

      splitInto(input, output, ' ');             // single space
      splitInto(input, output, +qi.lit(' '));    // one or more spaces
      splitInto(input, output, +qi.lit(" \t"));  // one or more spaces or tabs
      splitInto(input, output, (qi::double_ >> !'#') // -- any parse expression
    

    Note this changes behaviour for the default overload

    The old version treated repeated spaces as a single delimiter by default. You now have to explicitly specify that (2nd example) if you want it.

  3. 现在支持在引用值内使用引号 ("")(而不是让它们消失)

    See the code sample. Quite simple of course. Note that the sequence "" outside a quoted construct still represents the empty string (for compatibility with e.g. existing CSV output formats which quote empty strings redundantly)

  4. 除容器外还支持 boost 范围作为输入(例如 char[])

    Well, you ain't gonna need it (but it was rather handy for me in order to just be able to write splitInto("a char array", ...) :)

正如我一半预料的那样,您将需要部分引用字段(请参阅您的评论1。好吧,给你(瓶颈是让它在不同版本的 Boost 中一致地工作)。

介绍

读者的随机笔记和观察:

  • splitInto模板函数愉快地支持你扔给它的任何东西:

    • 来自 vector 或 std::string 或 std::wstring 的输入
    • 输出到 -- 演示中显示的一些组合 --
      • vector<string> (所有线条都变平了)
      • vector<vector<string>> (每行 token )
      • list<list<string>> (如果你愿意的话)
      • set<set<string>> (独特的线性标记集)
      • ...任何您梦想的容器
  • 用于展示 karma 输出生成的演示目的(特别是处理嵌套容器)
    • 备注:\n在输出中显示为 ?理解(safechars)
  • 为新的 Spirit 用户完成方便的管道(清晰的规则命名,注释 DEBUG 定义,以防您想玩弄东西)
  • 您可以指定任何 Spirit 解析表达式来匹配定界符。这意味着通过传递 +qi::lit(' ')而不是默认值(' '),您将跳过空字段(即重复的分隔符)

需要/测试的版本

这是使用

编译的
  • 海湾合作委员会 4.4.5,
  • gcc 4.5.1 和
  • 海湾合作委员会 4.6.1。

它对(测试)有效

  • 一直 boost 1.42.0(也可能是更早的版本)
  • boost 1.47.0。

Note: The flattening of output containers only seems to work for Spirit V2.5 (boost 1.47.0).
(this might be something simple as needing an extra include for older versions?)

代码!

//#define BOOST_SPIRIT_DEBUG
#define BOOST_SPIRIT_DEBUG_PRINT_SOME 80

// YAGNI #4 - support boost ranges in addition to containers as input (e.g. char[])
#define SUPPORT_BOOST_RANGE // our own define for splitInto
#include <boost/spirit/include/lex_lexertl.hpp>
#include <boost/spirit/include/qi.hpp>
#include <boost/spirit/include/karma.hpp>
#include <boost/spirit/include/phoenix.hpp> // for pre 1.47.0 boost only
#include <boost/spirit/version.hpp>
#include <sstream>

namespace /*anon*/
{
    namespace phx=boost::phoenix;
    namespace qi =boost::spirit::qi;
    namespace karma=boost::spirit::karma;

    template <typename Iterator, typename Output> 
        struct my_grammar : qi::grammar<Iterator, Output()>
    {
        typedef qi::rule<Iterator> delim_t;

        //my_grammar(delim_t const& _delim) : delim(_delim),
        my_grammar(delim_t _delim) : delim(_delim),
            my_grammar::base_type(rule, "quoted_delimited")
        {
            using namespace qi;

            noquote = char_ - '"';
            plain   = +((!delim) >> (noquote - eol));
            quoted  = lit('"') > *(noquote | '"' >> char_('"')) > '"';

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
            mixed   = *(quoted|plain);
#else
            // manual folding
            mixed   = *( (quoted|plain) [_a << _1]) [_val=_a.str()];
#endif

            // you gotta love simple truths:
            rule    = mixed % delim % eol;

            BOOST_SPIRIT_DEBUG_NODE(rule);
            BOOST_SPIRIT_DEBUG_NODE(plain);
            BOOST_SPIRIT_DEBUG_NODE(quoted);
            BOOST_SPIRIT_DEBUG_NODE(noquote);
            BOOST_SPIRIT_DEBUG_NODE(delim);
        }

      private:
        qi::rule<Iterator>                  delim;
        qi::rule<Iterator, char()>          noquote;
#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
        qi::rule<Iterator, std::string()>   plain, quoted, mixed;
#else
        qi::rule<Iterator, std::string()>   plain, quoted;
        qi::rule<Iterator, std::string(), qi::locals<std::ostringstream> > mixed;
#endif
        qi::rule<Iterator, Output()> rule;
    };
}

template <typename Input, typename Container, typename Delim>
    bool splitInto(const Input& input, Container& result, Delim delim)
{
#ifdef SUPPORT_BOOST_RANGE
    typedef typename boost::range_const_iterator<Input>::type It;
    It first(boost::begin(input)), last(boost::end(input));
#else
    typedef typename Input::const_iterator It;
    It first(input.begin()), last(input.end());
#endif

    try
    {
        my_grammar<It, Container> parser(delim);

        bool r = qi::parse(first, last, parser, result);

        r = r && (first == last);

        if (!r)
            std::cerr << "parsing failed at: \"" << std::string(first, last) << "\"\n";
        return r;
    }
    catch (const qi::expectation_failure<It>& e)
    {
        std::cerr << "FIXME: expected " << e.what_ << ", got '";
        std::cerr << std::string(e.first, e.last) << "'" << std::endl;
        return false;
    }
}

template <typename Input, typename Container>
    bool splitInto(const Input& input, Container& result)
{
    return splitInto(input, result, ' '); // default space delimited
}


/********************************************************************
 * replaces '\n' character by '?' so that the demo output is more   *
 * comprehensible (see when a \n was parsed and when one was output *
 * deliberately)                                                    *
 ********************************************************************/
void safechars(char& ch)
{
    switch (ch) { case '\r': case '\n': ch = '?'; break; }
}

int main()
{
    using namespace karma; // demo output generators only :)
    std::string input;

#if SPIRIT_VERSION >= 0x2050 // boost 1.47.0
    // sample invocation: simple vector of elements in order - flattened across lines
    std::vector<std::string> flattened;

    input = "actually on\ntwo lines";
    if (splitInto(input, flattened))
        std::cout << format(*char_[safechars] % '|', flattened) << std::endl;
#endif
    std::list<std::set<std::string> > linewise, custom;

    // YAGNI #1 - now supports partially quoted columns
    input = "partially q\"oute\"d columns";
    if (splitInto(input, linewise))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', linewise) << std::endl;

    // YAGNI #2 - now supports custom delimiter expressions
    input="custom delimiters: 1997-03-14 10:13am"; 
    if (splitInto(input, custom, +qi::char_("- 0-9:"))
     && splitInto(input, custom, +(qi::char_ - qi::char_("0-9"))))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;

    // YAGNI #3 - now supports quotes ("") inside quoted values (instead of just making them disappear)
    input = "would like ne\"\"sted \"quotes like \"\"\n\"\" that\"";
    custom.clear();
    if (splitInto(input, custom, qi::char_("() ")))
        std::cout << format(( "set[" << ("'" << *char_[safechars] << "'") % ", " << "]") % '\n', custom) << std::endl;

    return 0;
}

输出

示例的输出如下所示:

actually|on|two|lines
set['columns', 'partially', 'qouted']
set['am', 'custom', 'delimiters']
set['', '03', '10', '13', '14', '1997']
set['like', 'nested', 'quotes like "?" that', 'would']

更新您之前失败的测试用例的输出:

--server=127.0.0.1:4774/|--username=robota|--userdescr=robot A ? I am cool robot ||--robot|>|echo.txt

1 我必须承认,当我读到“它崩溃了”[原文如此] 时,我笑得很开心。这听起来很像我的最终用户。准确地说:崩溃是不可恢复的应用程序故障。你遇到的是一个处理错误,从你的角度来看,只不过是“意外行为”。无论如何,现在已经解决了:)

关于c++ - 如何让我的拆分只在一个真实的行上工作并且能够跳过字符串的引用部分?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7436481/

相关文章:

c - 为什么 C 中的字符串函数对带有 char 而不是 unsigned char 的数组起作用?

regex - 解析正则表达式时创建自动机

c++ - 如何将 HTML 字符串加载到嵌入式 WebBrowser 控件?

c++ - QPropertyAnimation 不工作

java - 查找 "opposite"或结束字符

android - 如何用Java代码解析上传的apk文件?

JavaCC 问题 - 生成的代码没有找到所有解析错误

c++ - 可变参数模板中的 Typedef 或重命名包

c++ - 当函数有特定大小的数组参数时,为什么要用指针替换它?

c# - 正则表达式模式匹配 : Using only the start and end of a pattern for matching