早上好,我们正在尝试使用 C/C++ PCRE 正则表达式 "\x{00F6}.*\x{00E4} 来匹配德语字符串 'DAS tausendschöne Jungfräulein tausendschçne'。 *\x{00E7}"
。 PCRE 正则表达式仅从字节位置 14 和 43 开始匹配一次。我们的 PCRE 正则表达式是正确的还是应该更正?谢谢。
最佳答案
你误解了返回的数据。
PCRE 返回匹配的开始和结束位置。它在每种情况下只匹配了一次,但匹配包括匹配的整个字符串,包括被“无聊”的东西匹配的部分,如 .*
.
所以对于你的输入字符串,它匹配了这些部分:
DAS tausendschöne Jungfräulein tausendschçne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmm..
或者等效地它匹配了这些字节:
0 1 2 3 4 4
01234567890123456789012345678901234567890123456789
DAS tausendschöne Jungfräulein tausendschçne
..............mmmmmmmmmmmmmmmmmmmmmmmmmmmmmm...
它的行为是正确的。来自 http://www.pcre.org/pcre.txt:
When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the byte offset of the first character in a substring, and the second is set to the byte offset of the first character after the end of a substring. Note: these values are always byte offsets, even in UTF-8 mode. They are not character counts.
The first pair of integers, ovector[0] and ovector[1], identify the portion of the subject string matched by the entire pattern. The next pair is used for the first capturing subpattern, and so on.
关于c++ - 是否可以构建一个匹配 3 个或更多非连续 UTF 代码点的 PCRE UTF-8 正则表达式,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/11226534/