C99 标准 - fprintf - s 精确转换

标签 c unicode c99

假设只有 C99 Standard paper 和 printf 库函数需要根据此标准实现以使用 UTF-16 编码,您能否阐明指定精度的 s 转换的预期行为?

s 转换的 C99 标准 (7.19.6.1) 说:

If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type. Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character.

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.

我不太理解这一段,特别是“如果指定了精度,则写入的字节数不多于”这句话。

例如,我们以 UTF-16 字符串“TEST”(字节序列:0x54、0x00、0x45、0x00、0x53、0x00、0x54、0x00)为例。

在以下情况下,期望写入输出缓冲区的内容:

  • 如果精度为3
  • 如果精度为9(比字符串长度多一个字节)
  • 如果精度为12(比字符串长度多几个字节)

然后还有“数组中的宽字符被转换为多字节字符”。这是否意味着应该先将 UTF-16 转换为 UTF-8?如果我希望仅使用 UTF-16,这会很奇怪。

最佳答案

将评论转换为稍微扩展的答案。

CHAR_BIT 的值是多少?在您的实现中?

  • 如果 CHAR_BIT == 8 , 你不能用 %s 处理 UTF-16 ;你会用 %ls然后你会通过 wchar_t *作为相应的参数。然后您必须阅读规范的第二段。

  • 如果 CHAR_BIT == 16 ,那么数据中的八位字节数不能为奇数。然后你需要知道如何wchar_t涉及 char (它们大小相同吗?它们是否具有相同的符号?)并解释这两段以产生统一的效果——除非你决定使用 wchar_t表示 UTF-32。

关键是如果 CHAR_BIT == 8 则 UTF-16 不能作为 C 字符串处理因为有太多有用的字符被编码为一个字节为零,但那些零字节标记了一个以 null 结尾的字符串的结尾。要处理 UTF-16,可以是普通的 char类型必须是 16 位(或更大)类型(所以 CHAR_BIT > 8 ),或者您必须使用 wchar_t (和 sizeof(wchar_t) > sizeof(char) )。

请注意,规范要求将宽字符转换为合适的多字节表示形式。

如果你想要本地输出宽字符,你必须使用fwprintf()和来自 <wchar.h> 的相关功能,首先在 C99 中定义。那里的规范与 fprintf() 的规范有很多共同之处。 ,但存在(不出所料)重要差异。

7.29.2.1 The fwprintf function

s
If no l length modifier is present, the argument shall be a pointer to the initial element of a character array containing a multibyte character sequence beginning in the initial shift state. Characters from the array are converted as if by repeated calls to the mbrtowc function, with the conversion state described by an mbstate_t object initialized to zero before the first multibyte character is converted, and written up to (but not including) the terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the converted array, the converted array shall contain a null wide character.

If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are written up to (but not including) a terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null wide character.

关于C99 标准 - fprintf - s 精确转换,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/39686771/

相关文章:

c - 在自身内部定义结构对象的问题

c - C 中的指针赋值、malloc() 和 free()

java - 指向结构指针的指针作为 JNA 中的参数

c# - 将整数作为 varbinary 发送到 SQL 触发器

c - 赋值运算符和序列点的副作用

c - C99 是否强制要求 `int64_t` 类型始终可用?

c - 内核模式下的线程(和进程)和用户模式下的线程(和进程)有什么区别?

python - 如何在 Python 3 CGI 中打印 unicode 字符?

python-2.7 - 在 Python 中将 Unicode 文本转换为可读文本

c - snprintf 的隐式声明