c++ - 使用 boostspirit 解析 html 转义序列

我尝试使用 html 转义序列解析文本，并希望使用 utf8 等效项来更改此 esaceps:

&nbsp; - 0xC2A0 utf8 representation
&shy; - 0xC2AD utf8 representation

并用语法来解决这个问题

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
{
    HTMLEscape_grammar() :
        HTMLEscape_grammar::base_type(text)
    {
        htmlescapes.add("&nbsp;", 0xC2AD);
        htmlescapes.add("&shy;", 0xC2AD);

        text = +((+(qi::char_ - htmlescapes)) | htmlescapes);
    }

private:
    qi::symbols<char, uint32_t> htmlescapes;
    qi::rule<Iterator, std::string()> text;
};

但是当我们解析时

std::string l_test = "test&shy;as test simple&shy;test";
HTMLEscape_grammar<std::string::const_iterator> l_gramar;

std::string l_ast;
bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

我们没有得到utf-8字符串，只是简单地剪切了utf8符号的0xC2部分，得到了简单的ascii字符串。该解析器是更强大系统的构建 block ，因此需要 utf8 输出。

最佳答案

我不知道你如何认为暴露 uint32_t 会神奇地输出 UNICODE 代码点。更不用说某些东西会神奇地执行 UTF8 编码了。

现在让我弄清楚这一点。您希望将选定的 HTML 实体引用替换为 슭 (HANGUL SYLLABLE SEULG)。在 UTF-8 中，这将是 0xEC 0x8A 0xAD。

只需自己进行编码(无论如何，您都在编写 UTF8 代码单元的输出流):

Live On Coliru

#include <boost/spirit/include/qi.hpp>
#include <iostream>
#include <iomanip>

namespace qi = boost::spirit::qi;

template <typename Iterator>
struct HTMLEscape_grammar : qi::grammar<Iterator, std::string()>
{
    HTMLEscape_grammar() :
        HTMLEscape_grammar::base_type(text)
    {
        htmlescapes.add("&nbsp;", { '\xEC', '\x8A', '\xAD' });
        htmlescapes.add("&shy;",  { '\xEC', '\x8A', '\xAD' });

        text = *(htmlescapes | qi::char_);
    }

private:
    qi::symbols<char, std::vector<char> > htmlescapes;
    qi::rule<Iterator, std::string()> text;
};

int main() {
    std::string const l_test = "test&shy;as test simple&shy;test";
    HTMLEscape_grammar<std::string::const_iterator> l_gramar;

    std::string l_ast;
    bool result = qi::parse(l_test.begin(), l_test.end(), l_gramar, l_ast);

    if (result) {
        std::cout << "Parse success\n";
        for (unsigned char ch : l_ast)
            std::cout << std::setw(2) << std::setfill('0') << std::hex << std::showbase << static_cast<int>(ch) << " ";
    } else
    {
        std::cout << "Parse failure\n";
    }
}

打印

Parse success
0x74 0x65 0x73 0x74 0xec 0x8a 0xad 0x61 0x73 0x20 0x74 0x65 0x73 0x74 0x20 0x73 0x69 0x6d 0x70 0x6c 0x65 0xec 0x8a 0xad 0x74 0x65 0x73 0x74

关于c++ - 使用 boostspirit 解析 html 转义序列，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/30482582/

c++ - 使用 boostspirit 解析 html 转义序列

上一篇：c++ - 静态存储持续时间对象的破坏和未定义的行为

下一篇：c++ - return 语句中的统一初始化以及将运算符显式转换为 bool