我的经理让我解释为什么我调用 jdom 的 checkCharacterData
在将我的字符串传递给 XMLStreamWriter
之前,我引用了 XML 规范然后感到困惑。
XML 1.0和 XML 1.1假设有效的 XML 字符是“制表符、回车符、换行符以及 Unicode 和 ISO/IEC 10646 的合法字符”。这听起来很愚蠢:制表符、回车符和换行符 是 Unicode 的合法字符。然后是注释“任何 Unicode 字符,不包括代理项 block 、FFFE 和 FFFF”,它在 XML 1.1 中被修改为引用 U+0000 – U+10FFFF,不包括 U+0000、U+D800 – U+DFFF,以及U+FFFE – U+FFFF;注意 NUL 被排除在外。然后是注释说作者“不鼓励”使用兼容性字符,包括一些已经被 BNF 排除的字符。
问题:什么是/曾经是合法的 Unicode 字符? NUL 是有效的 Unicode 字符吗? (我找到了 ISO 10646(2010 年第 2 版)的 pdf,它似乎没有排除 U+0000。)ISO 10646 或 Unicode 在 2000 版和 2010 版之间是否发生了变化,以包含以前被排除在外的控制字符?至于 XML,是否存在文本如此宽松/草率而 BNF 却如此严格的原因?
最佳答案
Question: What is/was a legal Unicode character?
The Unicode Glossary如此定义它:
Character. (1) The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. (2) Synonym for abstract character. (3) The basic unit of encoding for the Unicode character encoding. (4) The English name for the ideographic written elements of Chinese origin. [See ideograph (2).]
Is NUL a valid Unicode character? (I found a pdf of ISO 10646 (2nd edition, 2010) which doesn’t seem to exclude U+0000.)
NUL 是一个代码点,它属于“抽象字符”的定义,因此它是上述第 2 种意义上的字符。
Did ISO 10646 or Unicode change between the 2000 edition and the 2010 edition to include control characters that were previously excluded?
NUL 是早期版本的控制字符。 Appendix D包含更改列表。
表D.2表示从Version 1到Version 3共有65个控制字符没有变化。
Table D-2 documents the number of characters assigned in the different versions of the Unicode standard.
V1.0 V1.1 V2.0 V2.1 V3.0 ... Controls 65 65 65 65 65
And as for XML, is there a reason that the text is so lenient/sloppy while the BNF is strict?
编写既完整又简洁的规范很难。当文本不同意 BNF 时,请相信 BNF。
关于unicode - XML 和 Unicode 规范 : what’s a legal character?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9526951/