c++ - boost::ireplace 是否可以像基本字符一样对待特殊字符？ (例如 'ź' 为 'z' )

所以，我正在制作一个单词过滤器，用星号替换坏单词，但是如果使用像 ąężźć 等特殊字符，则可能有很多可能的单词组合。

如何让 boost::ireplace_all 将它们视为基本字符 aezzc？

所以

boost::ireplace_all("żąć", "a", "*"); 和 boost::ireplace_all("zac", "a", "*");

会分别产生ż*ć和z*c吗？

编辑/扩展示例:

const std::set<std::string> badwords =
{
    "<not nice word>",
    "<another not nice word>"
};

void FilterBadWords(std::string& s)
{
    for (auto &badword : badwords)
        boost::ireplace_all(s, badword, "*");
}


int main()
{
    std::string a("hello you <not nice word> person");
    std::string b("hęlló you <nót Nićę wórd> person");
    FilterBadWords(a);
    FilterBadWords(b);
    //a equals "hello you * person"
    //b equals "hęlló you * person"
    //or as many * as the replaced string lenght, both are fine
}

最佳答案

Boost Locale 通过 ICU 支持主排序规则:

http://www.boost.org/doc/libs/1_53_0/libs/locale/doc/html/collation.html

事实证明，让它发挥作用非常棘手。基本上，与 char字符串，因为 Boost 字符串算法对代码点一无所知，只是逐字节迭代(和比较)输入序列(嗯， char by char ，但那就是这里有点困惑)。

因此，解决方案在于转换为 utf32 字符串(这可以通过使用 std::wstring 的 GCC 实现，因为 wchar_t 是 32 位的)。 Utf16 通常也应该“工作”，但它仍然存在我刚才概述的遍历问题，只是更加罕见。

现在，我创建了一个快速而简单的自定义 Finder 谓词:

template <typename CharT>
struct is_primcoll_equal
{
    is_primcoll_equal(const std::locale& Loc=std::locale()) :
        m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}

    template< typename T1, typename T2 >
        bool operator()(const T1& Arg1, const T2& Arg2) const {
            // TODO use `do_compare` methods on the collation itself that
            // don't construct basic_string<> instances
            return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
        }

  private:
    std::locale m_Loc;
    boost::locale::comparator<CharT> comp;
};

它效率极低，因为它每次调用都会构造单字符字符串。这是因为do_compare方法不是 collator<> 的公共(public) API 的一部分。我留下一个自定义collator<>并将其用作读者的练习。

接下来，我们模仿 replace_all接口(interface)通过包装 find_format_all相反:

 template<typename SequenceT, typename Range1T, typename Range2T>
    inline void collate_replace_all( 
            SequenceT& Input,
            const Range1T& Search,
            const Range2T& Format,
            const std::locale& Loc=std::locale() )
    {
        ::boost::algorithm::find_format_all( 
                Input, 
                ::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
                ::boost::algorithm::const_formatter(Format) );
    }

现在我们只需要字符串加宽转换就可以了:

void FilterBadWords(std::string& s) {
    using namespace boost::locale::conv;

    std::wstring widened = utf_to_utf<wchar_t>(s, stop);

    for (auto& badword : badwords) {
        detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
    }

    s = utf_to_utf<char>(widened, stop);
}

完整程序

Live Broken¹ On Coliru

#include <boost/algorithm/string/replace.hpp>
#include <boost/locale.hpp>
#include <iostream>
#include <locale>
#include <set>
#include <string>

const std::set<std::string> badwords =
{
    "<not nice word>", 
    "<another not nice word>" 
};

namespace detail {
    template <typename CharT>
    struct is_primcoll_equal
    {
        is_primcoll_equal(const std::locale& Loc=std::locale()) :
            m_Loc(Loc), comp(Loc, boost::locale::collator_base::primary) {}

        template< typename T1, typename T2 >
            bool operator()(const T1& Arg1, const T2& Arg2) const {
                // assert(0 == comp(L"<not nice word>", L"<nót Nićę wórd>"));
                // TODO use `do_compare` methods on the collation itself that
                // don't construct basic_string<> instances
                return 0 == comp(std::basic_string<CharT>(1, Arg1), std::basic_string<CharT>(1, Arg2));
            }

      private:
        std::locale m_Loc;
        boost::locale::comparator<CharT> comp;
    };

    template<typename SequenceT, typename Range1T, typename Range2T>
        inline void collate_replace_all( 
                SequenceT& Input,
                const Range1T& Search,
                const Range2T& Format,
                const std::locale& Loc=std::locale() )
        {
            ::boost::algorithm::find_format_all( 
                    Input, 
                    ::boost::algorithm::first_finder(Search, is_primcoll_equal<typename SequenceT::value_type>(Loc)),
                    ::boost::algorithm::const_formatter(Format) );
        }
}

void FilterBadWords(std::string& s) {
    using namespace boost::locale::conv;

    std::wstring widened = utf_to_utf<wchar_t>(s, stop);

    for (auto& badword : badwords) {
        detail::collate_replace_all(widened, badword, L"*"/*, loc*/);
    }

    s = utf_to_utf<char>(widened, stop);
}

static_assert(sizeof(wchar_t) == sizeof(uint32_t), "Required for robustness (surrogate pairs, anyone?)");

int main()
{
    auto loc = boost::locale::generator().generate("");
    std::locale::global(loc);

    std::string a("hello you <not nice word> person");
    std::string b("hęlló you <nót Nićę wórd> person");

    FilterBadWords(a);
    FilterBadWords(b);
    std::cout << a << "\n";
    std::cout << b << "\n";
}

输出

在我的系统上:

hello you * person
hęlló you * person

1 显然，Coliru 执行环境中的区域设置支持不完整

关于c++ - boost::ireplace 是否可以像基本字符一样对待特殊字符？ (例如 'ź' 为 'z' )，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27883220/

c++ - boost::ireplace 是否可以像基本字符一样对待特殊字符？ (例如 'ź' 为 'z' )

完整程序

输出

上一篇：c++ - VS2013中继承构造函数

下一篇：C++ 无法将参数 1 从 'int' 转换为 'const int '

c++ - boost::ireplace 是否可以像基本字符一样对待特殊字符？ (例如 'ź' 为 'z' )

完整程序

输出

上一篇：c++ - VS2013中继承构造函数

下一篇：C++ 无法将参数 1 从 'int**' 转换为 'const int **'

下一篇：C++ 无法将参数 1 从 'int' 转换为 'const int '