为 n 字节插入正向回顾会产生什么后果，(?<=\C{n}) , 到任意正则表达式的开头，特别是当用于替换操作时？

至少在 PHP 中，正则表达式匹配函数，preg_match和 preg_match_all , 允许在给定的字节偏移量之后开始匹配。在任何其他 PCRE PHP 函数中都没有相应的功能 - 您可以通过 preg_replace 指定对替换次数的限制。例如，但并不是说那些替换的匹配项必须出现在 n 字节之后。

显然会对性能和可读性产生一些(让我们称之为微不足道的)后果，但是否会有任何(非微不足道的)影响，例如匹配变为不匹配(除非它们没有偏移 n 字节)或替代品变得畸形？

一些例子:

/some expression/变成 /(?<=\C{4})some expression/对于 4 字节的偏移量

/(this) has (groups)/i变成 /(?<=\C{2})(this) has (groups)/i对于 2 字节的偏移量

据我所知，从我运行的有限测试来看，添加这个回顾有效地模拟了这个偏移参数，并且不会与任何其他回顾、替换或其他控制模式混淆；但我也不是 Regex 方面的专家。

我试图通过将 n 字节回顾插入模式来确定构建替换/过滤功能扩展是否有任何可能的后果。它应该像匹配函数的偏移参数一样运行 - 所以只需针对 substr( $subject, $offset ) 运行表达式即可不会因为与 preg_match 相同的原因而无法工作(最值得注意的是，它会切断任何后视，然后 ^ 会错误地匹配子字符串的开头，而不是原始字符串)。

最佳答案

简答

在非UTF模式下，UTF-8库

假设与 PHP 捆绑的 PCRE 库被编译为8 位库 (UTF-8)，然后在非 UTF 模式下

\C

相当于

[\x00-\xff]

和

(?s:.)

它们中的任何一个都可以在后视中用作 preg_match 和 preg_match_all 中的 offset 字段的替换> 功能。

在非UTF模式下，它们都匹配1个数据单元，即8位(UTF-8)PCRE库中的1个字节，它们匹配所有256个可能的不同值。

在 UTF 模式下，UTF-8 库

UTF 模式可以通过传递给 preg_* 函数的模式中的 u 标志激活，或者通过指定 (*UTF)，(*UTF8), (*UTF16), (*UTF32) 模式开头的动词。

在 UTF 模式下，字符类 [] 和点元字符 . 将匹配 Unicode 字符有效范围内的一个代码点，而不是代理项。由于一个代码点可以在 UTF-8 中编码为 1 到 4 个字节，并且由于 UTF-8 的编码方案，不可能使用字符类构造来匹配 0x80 到 0xFF 范围内的值的单个字节.

虽然 \C 专门设计用于匹配一个数据单元(在 UTF-8 中为一个字节)，而不管 UTF 模式是否打开，它在后视构造中不受支持在 UTF 模式下。

UTF-16 和 UTF-32 库

我不知道是否有人实际编译了 16 位或 32 位 PCRE 库，将其包含在 PHP 库中并实际运行。如果有人知道这种构建在野外被广泛使用，请联系我。我实际上不知道字符串和来自 PHP 的偏移量是如何传递到 PCRE 的 C API 的，这取决于 preg_* 函数的结果可能不同。

更多详情

在 PCRE 库的 C API 级别，您只能使用数据单元，8 位库以 8 位为单位，16 位库以 16 位为单位，32 位库以 32 位为单位-位库。

对于 8 位库 (UTF-8)，1 个数据单元是 8 位或 1 个字节，因此以字节为单位指定偏移量没有太大障碍，无论是作为函数的参数，还是作为正则表达式构造。

正则表达式构造

在非UTF模式下，字符类[]、点.和\C正好匹配1个数据单元。

\C 匹配 1 个数据单元，无论是 UTF 模式还是非 UTF 模式。不过，它不能用于 UTF 模式的后视。

MATCHING A SINGLE DATA UNIT

Outside a character class, the escape sequence \C matches any one data unit, whether or not a UTF mode is set.
.在非UTF模式下匹配1个数据单元。
General comments about UTF modes

[...]
1. The dot metacharacter matches one UTF character instead of a single data unit.
字符类在非UTF模式下匹配1个数据单元。文档没有明确说明这一点，但措辞暗示了这一点。

SQUARE BRACKETS AND CHARACTER CLASSES

[...]

A character class matches a single character in the subject. In a UTF mode, the character may be more than one data unit long.

通过查看\x{hh...}语法的上限，在非UTF模式下用十六进制代码指定字符，也可以得出同样的结论。通过测试，最后一条关于surrogate的条款似乎并不适用于非UTF模式。
Characters that are specified using octal or hexadecimal numbers are limited to certain values, as follows:
```
 8-bit non-UTF mode    less than 0x100
 8-bit UTF-8 mode      less than 0x10ffff and a valid codepoint
 16-bit non-UTF mode   less than 0x10000
 16-bit UTF-16 mode    less than 0x10ffff and a valid codepoint
 32-bit non-UTF mode   less than 0x100000000
 32-bit UTF-32 mode    less than 0x10ffff and a valid codepoint
```
Invalid Unicode codepoints are the range 0xd800 to 0xdfff (the so- called "surrogate" codepoints), and 0xffef.

偏移量

提供和返回的所有偏移量均以数据单元数为单位:

The string to be matched by pcre_exec()

The subject string is passed to pcre_exec() as a pointer in subject, a length in length, and a starting offset in startoffset. The units for length and startoffset are bytes for the 8-bit library, 16-bit data items for the 16-bit library, and 32-bit data items for the 32-bit library.

How pcre_exec() returns captured substrings

[...]

When a match is successful, information about captured substrings is returned in pairs of integers, starting at the beginning of ovector, and continuing up to two-thirds of its length at the most. The first element of each pair is set to the offset of the first character in a substring, and the second is set to the offset of the first character after the end of a substring. These values are always data unit off- sets, even in UTF mode.

关于php - 在任意正则表达式中插入正向后向模拟字节偏移的后果，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/27183857/

php - 在任意正则表达式中插入正向后向模拟字节偏移的后果

简答

在非UTF模式下，UTF-8库

在 UTF 模式下，UTF-8 库

UTF-16 和 UTF-32 库

更多详情

正则表达式构造

偏移量

上一篇：php - 如何在php中设置http响应状态码和消息

下一篇：php - 仅匹配来自相同语言的字符集(如 facebook 名称)？