c# - Java 和 C# 的 String 实现

标签 c# java string arrays

String 的 Java 和 C# 实现中,底层信息是否像 C/C++ 中那样以 null 结尾的 char 数组?

(除了尺寸等其他信息)

最佳答案

没有。它是 UTF-16 的序列代码单位和长度。 Java 和 C# 字符串可以包含嵌入的 NUL。

每个 UTF-16 代码单元占用两个字节,因此您可以将字符串 "\n\0\n" 视为:

{
  length: 3,  // 3 pairs of bytes == 3 UTF-16 code units
  bytes:  [0, 10, // \n
           0, 0,  // \0
           0, 10] // \n
}

请注意,bytes 中的最后一个字节不是 0。length 字段表明使用了多少字节。这使得 substring 非常高效——重用相同的字节数组,但长度不同(如果您的 VM 实现不能指向数组,则进行偏移)。

UTF-16 (16-bit Unicode Transformation Format) is a character encoding for Unicode capable of encoding 1,112,064 numbers (called code points) in the Unicode code space from 0 to 0x10FFFF. It produces a variable-length result of either one or two 16-bit code units per code point.

来自 javadoc

A String represents a string in the UTF-16 format in which supplementary characters are represented by surrogate pairs (see the section Unicode Character Representations in the Character class for more information). Index values refer to char code units, so a supplementary character uses two positions in a String.

C# System.String定义类似

Each Unicode character in a string is defined by a Unicode scalar value, also called a Unicode code point or the ordinal (numeric) value of the Unicode character. Each code point is encoded using UTF-16 encoding, and the numeric value of each element of the encoding is represented by a Char. The resulting collection of Char objects constitutes the String.

我不确定 C# 是否防止孤立的代理项,但上面的文本似乎混淆了术语“标量值”和“代码点”,这令人困惑。 scalar value因此由 unicode.org 定义:

Any Unicode code point except high-surrogate and low-surrogate code points

Java 绝对采用代码点 View ,并且不会尝试防止字符串中的无效标量值。

"Strings Immutability and Persistence"解释这种表示的效率优势。

One of the benefits of the immutable data types I've talked about here previously is that they are not just immutable, they are also "persistent". By "persistent", I mean an immutable data type such that common operations on that type (like adding a new item to a queue, or removing an item from a tree) can re-use most or all of the memory of an existing data structure. Since it is all immutable, you can re-use its parts without worrying about them changing on you.

编辑: 以上在概念上和实践中都是正确的,但 VM 和 CLR 在某些情况下可以自由地做不同的事情。

Java 语言规范要求字符串是 laid out a certain way.class 文件中,及其 JNI jstring type 抽象出内存中的表示细节,因此从理论上讲,VM 可以将内存中的字符串表示为以 NUL 结尾的 UTF-8 字符串,其双字节形式用于嵌入 NUL 字符而不是 int32 长度uint16[] bytes 表示,允许高效随机访问代码单元。

但虚拟机在实践中不会这样做。 "The Most Expensive One-byte Mistake"认为以 NUL 结尾的字符串在 C 中是一个巨大的错误,因此我怀疑 VM 会出于效率原因在内部采用它们。

The best candidate I have been able to come up with is the C/Unix/Posix use of NUL-terminated text strings. The choice was really simple: Should the C language represent strings as an address + length tuple or just as the address with a magic character (NUL) marking the end?

...

Thinking a bit about virtual memory systems settles that question for us. Optimizing the movement of a known-length string of bytes can take advantage of the full width of memory buses and cache lines, without ever touching a memory location that is not part of the source or destination string.

One example is FreeBSD's libc, where the bcopy(3)/memcpy(3) implementation will move as much data as possible in chunks of "unsigned long," typically 32 or 64 bits, and then "mop up any trailing bytes" as the comment describes it, with byte-wide operations.2

If the source string is NUL terminated, however, attempting to access it in units larger than bytes risks attempting to read characters after the NUL. If the NUL character is the last byte of a [virtual memory] page and the next [virtual memory] page is not defined, this would cause the process to die from an unwarranted "page not present" fault.

关于c# - Java 和 C# 的 String 实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7352611/

相关文章:

c# - NHibernate 3.2 通过代码映射忽略我的 IUserType

c# - 使用 unc 路径(相对)唯一地识别机器

c# - ASP.NET 空标签文本

java - 将字符串转换为数组或 float 列表

linux - 如何grep搜索字符串但从结果中省略字符串

c# - WPF 位图图像异常 "The image data generated an overflow during processing"

java - 如何有效管理Android应用程序的内存(堆)

javax.script.ScriptEngine 在运行时失败

java - 其他 JFrame 中的 JTextArea 显示实时控制台输出

Java:使用正则表达式 .*-\\d+{.*}\\d+-.* 引发 PatternSyntaxException