c++ - std::regex 是否始终了解区域设置?

标签 c++ c++11

std::basic_regex引用,std::regex 构造函数的标志之一是 collat​​e,它指定:

Character ranges of the form "[a-b]" will be locale sensitive.

这对我来说表明,默认情况下,std::regex 不(完全)支持语言环境。我找不到任何声称它明确区域设置感知的内容,但是我们有std::regex_traits,这表明存在一些 正在进行语言环境感知。

std::regex 区域设置感知程度如何? 是否可以读取 UTF-8 字符串并将其存储在普通的 std::string 中,并且仅使用正则表达式类,例如 [:w:] [:punct:]?具体来说,[:w:] 可能是一个问题。 [:punct:] 并不重要。

这是一个 C++ 库,必须在 MacOS(具有 UTF-8 语言环境)和 Windows(据我所知,不支持)上运行。

最佳答案

one of the flags for the constructor of a std::regex is collate, which specifies that:

Character ranges of the form "[a-b]" will be locale sensitive.

有关全面的解释,请参阅 Regexp Ranges and Locales: A Long Sad Story :

However, the standard changed the interpretation of range expressions. In the "C" and "POSIX" locales, a range expression like ‘[a-dx-z]’ is still equivalent to ‘[abcdxyz]’, as in ASCII. But outside those locales, the ordering was defined to be based on collation order.

What does that mean? In many locales, ‘A’ and ‘a’ are both less than ‘B’. In other words, these locales sort characters in dictionary order, and ‘[a-dx-z]’ is typically not equivalent to ‘[abcdxyz]’; instead, it might be equivalent to ‘[ABCXYabcdxyz]’, for example.

This point needs to be emphasized: much literature teaches that you should use ‘[a-z]’ to match a lowercase character. But on systems with non-ASCII locales, this also matches all of the uppercase characters except ‘A’ or ‘Z’! This was a continuous cause of confusion, even well into the twenty-first century.


This indicates, to me, that std::regex is not, by default, (entirely) locale-aware.

不完全是。

Modified ECMAScript regular expression grammar它说:

Character classes

...

The exact meaning of each of these character class escapes in C++ is defined in terms of the locale-dependent named character classes, and not by explicitly listing the acceptable characters as in ECMAScript.

换句话说,它使用字符类的当前全局区域设置,例如 [:alpha:] .


Is it possible to read a UTF-8 string and store it in a plain std::string and just use regex classes such as [:w:] and [:punct:]? Specifically, [:w:] might be a problem. [:punct:] is not important.

不知道 std::string 的内容是什么编码是,它们可以采用 UTF-8 或任何其他编码。

您需要解码 std::string进入std::wstring ,一种方法是使用 std::codecvt_utf8 提供的设施,然后使用std::wregex .

关于c++ - std::regex 是否始终了解区域设置?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48222974/

相关文章:

c++ - QBitArray 到 QByteArray

c++ - 从 QTreeWidget 覆盖复制文本操作

c++ - 如何返回对包含 nullptr 的 unique_ptr 的引用

c++ - 删除原子/防护数据中的数据

c++ - 如何通过结构体中的变量确定结构体是否存在于结构体​​ vector 中

c++ - 如果不存在,则调用自由函数而不是方法

iphone - 如何使用不同角度的弹弓移动物体?

c++ - 智能指针的迭代和容器

c++ - 避免/usr/include/boost

c++ - std::promise 外部代码,异步取消