c# - 正则表达式未使用 Unicode 字符范围

标签 c# .net regex unicode

NOTE

Another question was asked C# Regular Expressions with \Uxxxxxxxx characters in the pattern already. This question differs in that it is not about how surrogate pairs are calculated, but how to express unicode planes higher than 0 in a regex. It should be clear from my question that I already understand why these code units are being expressed as 2 characters - they are surrogate pairs (which was what the other question is asking about). My question is how can I convert them generically (since I have no control over what the regex being fed to the program looks like) so they can be consumed by the .NET Regex engine.

Note I now have a way to do this and would like to add my answer to my question, but since this is now marked as a duplicate I cannot add my answer.

我有一些测试数据正在传递到我正在移植到 C# 的 Java 库。我已经隔离了一个特定的问题案例作为示例。原始字符类采用 UTF-32 = \U0001BCA0-\U0001BCA3 格式,.NET 不容易使用它 - 我们得到一个 “无法识别的转义序列\U” 错误。

我尝试转换为 UTF-16,并且已确认 \U0001BCA0 的结果和 \U0001BCA3是应该预期的。

UTF-32      | Codepoint   | High Surrogate  | Low Surrogate  | UTF-16
---------------------------------------------------------------------------
0x0001BCA0  | 113824      | 55343           | 56480          | \uD82F\uDCA0
0x0001BCA3  | 113827      | 55343           | 56483          | \uD82F\uDCA3

但是,当我将字符串 "([\uD82F\uDCA0-\uD82F\uDCA3])" 传递给 Regex 类的构造函数时,我得到一个异常“[x-y]范围相反”

虽然很明显字符是按正确的顺序指定的(它在 Java 中有效),但我反向尝试并得到了相同的错误消息。

我还尝试将 UTF-32 字符从 \U0001BCA0-\U0001BCA3 更改为 \x01BCA0-\x01BCA3,但仍然出现异常 "[x-y ] 范围相反”

那么,如何让 .NET Regex 类成功解析此字符范围?

NOTE: I tried changing the code to generate a regex character class that includes all of the characters instead of a range and it seems to work, but that is going to turn my regexes that are a few dozen characters into several thousand characters, which surely isn't going to do wonders for performance.

实际的正则表达式示例

同样,上面是一个更大字符串中失败的孤立示例。我正在寻找一种转换此类正则表达式的通用方法,以便它们可以由 .NET Regex 类进行解析。

"([\\u0000-\\u0009\\u000B\\u000C\\u000E-\\u001F\\u007F-\\u009F\\u00AD" +
"\\u061C\\u180E\\u200B\\u200E\\u200F\\u2028-\\u202E\\u2060-\\u206F\\uD800-" +
"\\uDFFF\\uFEFF\\uFFF0-\\uFFFB\\U0001BCA0-\\U0001BCA3\\U0001D173-" +
"\\U0001D17A\\U000E0000-\\U000E001F\\U000E0080-\\U000E00FF\\U000E01F0-\\U000E0FFF] " +
"| [\\u000D] | [\\u000A]) ()"

最佳答案

您假设 Regex 会将 "\uD82F\uDCA0" 识别为复合字符。事实并非如此,因为 .NET 中字符串的内部表示形式是 16 位 Unicode。

Unicode有code points的概念这是一个独立于物理表示的抽象概念。根据实际使用的编码,并非所有代码点都可以显示在一个字符中。在 UTF-8 中,这变得非常明显,因为所有高于 127 的代码点都需要两个或更多字符。在.NET中,字符是Unicode,这意味着planes大于 0 则需要组合字符。但正则表达式引擎仍将这些字符识别为单个字符。

长话短说:不要将字符组合视为代码点,而应将它们视为单个字符。所以在你的情况下,正则表达式将是:

using System;
using System.Text.RegularExpressions;

public class Program
{
    public static void Main()
    {
        var regex = new Regex("(\uD82F[\uDCA0-\uDCA3])");
        Console.WriteLine(regex.Match("\uD82F\uDCA2").Success);
    }
}

您可以try out the code here .

关于c# - 正则表达式未使用 Unicode 字符范围,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/47605037/

相关文章:

c# - 识别奇数、偶数——二进制与模数

c# - 如何将 Dapper 与包含两个表之间的 INNER JOIN 的 SELECT 存储过程一起使用?

javascript - 如何在javascript中替换匹配部分之后的子字符串?

c# - .NET 如何处理范围内的变量

Java : regex split() based on 2 delimiters

c# - 替换单词,但仅在特定行上

c# - 使用 itextSharp 从 pdf 中提取文本会更改数字

c# - 替换字符串中的字符

c# - 为什么选项卡页面主体不使用 .NET 选项卡控件进行更新?

c# - 可扩展的.NET配置