c# - 在包含 UTF-8 数据的字节数组中找到最近的安全分割

标签 c# string encoding utf-8

我想拆分大量 UTF-8 编码数据,以便可以将其并行解码为字符。

似乎没有办法找出Encoding.GetCharCount 读取了多少字节。我也不能使用 GetByteCount(GetChars(...)),因为它无论如何都会解码整个数组,这是我试图避免的。

最佳答案

UTF-8 具有明确定义的字节序列,被认为是自同步,这意味着给定 bytes 中的任何位置,您都可以找到该位置的字符开始的位置。

UTF-8 规范 ( Wikipedia is the easiest link ) 定义了以下字节序列:

0_______ : ASCII (0-127) char
10______ : Continuation
110_____ : Two-byte character
1110____ : Three-byte character
11110___ : Four-byte character

因此,以下方法(或类似方法)应该可以得到您的结果:

  1. 获取 bytes 的字节数(bytes.Length 等)
  2. 确定要分成多少个部分
  3. 选择字节 byteCount / sectionCount
  4. 对照表测试字节:
    1. 如果是 byte & 0x80 == 0x00 那么您可以将此字节作为任一部分的一部分
    2. 如果是byte & 0xE0 == 0xC0则需要向前查找一个字节,与当前段保持一致
    3. 如果是byte & 0xF0 == 0xE0则需要向前查找两个字节,并与当前段保持一致
    4. 如果是byte & 0xF8 == 0xF0则需要向前查找三个字节,与当前段保持一致
    5. 如果 byte & 0xC0 == 0x80 那么你在继续,并且应该向前寻找直到第一个字节不适合 val & 0xB0 == 0x80 ,然后在当前部分保持(但不包括)这个值
  5. 选择 byteStartbyteCount + offset,其中 offset 可以通过上面的测试定义
  6. 对每个部分重复。

当然,如果我们将测试重新定义为返回当前 字符起始位置,我们有两种情况: 1.如果(byte[i] & 0xC0) == 0x80那么我们需要在数组中移动 2. 否则,返回当前的 i(因为它不是一个延续)

这为我们提供了以下方法:

public static int GetCharStart(ref byte[] arr, int index) =>
    (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;

接下来,我们要获取每个部分。最简单的方法是使用状态机(或滥用,取决于您如何看待它)来返回部分:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;

    for (var i = 0; i < sectionCount; i++)
    {
        sectionEnd = i == (sectionCount - 1) ? utf8Array.Length : GetCharStart(ref utf8Array, (int)Math.Round((double)utf8Array.Length / sectionCount * i));
        yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        sectionStart = sectionEnd;
    }
}

现在我以这种方式构建它,因为我想使用 Parallel.ForEach 来演示结果,如果我们有一个 IEnumerable ,这将 super 变得容易,而且它还让我非常 懒惰处理:我们只在需要时继续收集部分,这意味着我们可以懒惰地处理它并按需,这是一件好事,不是吗?

最后,我们需要能够获取一段字节,所以我们有 GetSection 方法:

public static byte[] GetSection(ref byte[] array, int start, int end)
{
    var result = new byte[end - start];
    for (var i = 0; i < result.Length; i++)
    {
        result[i] = array[i + start];
    }
    return result;
}

最后是演示:

var sourceText = "Some test 平仮名, ひらがな string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.";
var source = Encoding.UTF8.GetBytes(sourceText);
Console.WriteLine(sourceText);

var results = new ConcurrentBag<string>();
Parallel.ForEach(GetByteSections(source, 10),
                    new ParallelOptions { MaxDegreeOfParallelism = 1 },
                    x => { Console.WriteLine(Encoding.UTF8.GetString(x)); results.Add(Encoding.UTF8.GetString(x)); });

Console.WriteLine();
Console.WriteLine("Assemble the result: ");
Console.WriteLine(string.Join("", results.Reverse()));
Console.ReadLine();

结果:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.

Some test ???, ??
?? string that should b
e decoded in parallel, thi
s demonstrates that we work
 flawlessly with Parallel.
ForEach. The only downside
to using `Parallel.ForEach`
 the way I demonstrate is
that it doesn't take order into account, but oh-well.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well.

虽然不完美,但确实可以。如果我们将 MaxDegreesOfParallelism 更改为更高的值,我们的字符串就会变得困惑:

Some test ???, ??
e decoded in parallel, thi
 flawlessly with Parallel.
?? string that should b
to using `Parallel.ForEach`
ForEach. The only downside
that it doesn't take order into account, but oh-well.
s demonstrates that we work
 the way I demonstrate is

所以,如您所见,非常简单。您需要进行修改以允许正确的顺序重组,但这应该可以证明技巧。

如果我们按如下方式修改 GetByteSections 方法,最后一部分不再是其余部分大小的 ~2 倍:

public static IEnumerable<byte[]> GetByteSections(byte[] utf8Array, int sectionCount)
{
    var sectionStart = 0;
    var sectionEnd = 0;
    var sectionSize = (int)Math.Ceiling((double)utf8Array.Length / sectionCount);

    for (var i = 0; i < sectionCount; i++)
    {
        if (i == (sectionCount - 1))
        {
            var lengthRem = utf8Array.Length - i * sectionSize;
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
            sectionEnd = utf8Array.Length;
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
        }
        else
        {
            sectionEnd = GetCharStart(ref utf8Array, i * sectionSize);
            yield return GetSection(ref utf8Array, sectionStart, sectionEnd);
            sectionStart = sectionEnd;
        }
    }
}

结果:

Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.

Some test ???, ???? string that should be de
coded in parallel, this demonstrates that we work flawless
ly with Parallel.ForEach. The only downside to using `Para
llel.ForEach` the way I demonstrate is that it doesn't tak
e order into account, but oh-well. We can continue to incr
ease the length of this string to demonstrate that the las
t section is usually about double the size of the other se
ctions, we could fix that if we really wanted to. In fact,
 with a small modification it does so, we just have to rem
ember that we'll end up with `sectionCount + 1` results.

Assemble the result:
Some test ???, ???? string that should be decoded in parallel, this demonstrates that we work flawlessly with Parallel.ForEach. The only downside to using `Parallel.ForEach` the way I demonstrate is that it doesn't take order into account, but oh-well. We can continue to increase the length of this string to demonstrate that the last section is usually about double the size of the other sections, we could fix that if we really wanted to. In fact, with a small modification it does so, we just have to remember that we'll end up with `sectionCount + 1` results.

最后,如果由于某种原因你分成了一个异常大的部分,与输入大小相比(我的输入大小约为 578 字节,250 个字符证明了这一点)你会遇到 IndexOutOfRangeExceptionGetCharStart 中,以下版本修复了该问题:

public static int GetCharStart(ref byte[] arr, int index)
{
    if (index > arr.Length)
    {
        index = arr.Length - 1;
    }

    return (arr[index] & 0xC0) == 0x80 ? GetCharStart(ref arr, index - 1) : index;
}

当然,这会给您留下一堆空结果,但是当您重新组合时,字符串不会改变,所以我什至不打算在这里发布完整的场景测试。 (我把它留给你去试验。)

关于c# - 在包含 UTF-8 数据的字节数组中找到最近的安全分割,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/45201142/

相关文章:

c# - 放置WCF命名空间与不放置WCF命名空间的区别?

c# - 为什么我的 GTK# 编译二进制文件(使用 MonoDevelop 创建)不能在 Windows 上运行?

c# - C#中使用HtmlAgilityPack解析网页信息

Javascript - 存储并返回与用户提供的字符串标题关联的数字的数组

python - 将非 ascii 字符分配给 python 内置字符串时,该值意味着什么?

c# - 使用 ValidateCredentials 重复失败的登录尝试是否会导致用户被锁定?

c - 如何从 C 中的字符串中删除回车?

c++ - 无法标记字符串并将其传递给 C++ 中的结构

json.Marshal 自定义类型为 base64 字符串

video - FFMPEG:修复低运动区域的口吃