HTML 和 XML 中的特殊保留字符实体是什么?
我所掌握的信息是这样的:
HTML:
-
&
(替换为&
) -
<
(替换为<
) - <罢工>
>
(替换为>
) -
"
(替换为"
) - <罢工>
'
(替换为'
)
XML:
-
<
(替换为<
) -
>
(替换为>
) -
&
(替换为&
) -
'
(替换为'
) -
"
(替换为"
)
但我找不到关于这些的文档。
W3C 在 Extensible Markup Language (XML) 1.0 (Fifth Edition) 中确实提到了,某些预定义的实体引用。但它说这些实体是预定义的(与预定义 ©
的方式相同);并不是说它们必须被转义:
4.6 Predefined Entities
[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " < " and " & " may be used to escape < and & when they occur in character data.]
哪些字符必须转义为HTML中的实体引用? 哪些字符必须转义为XML中的实体引用?
更新:
来自 Extensible Markup Language (XML) 1.0 (Fifth Edition) :
2.4 Character Data and Markup
The ampersand character (
&
) and the left angle bracket (<
) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&
" and "<
" respectively.The right angle bracket (
>
) may be represented using the string ">
", and must, for compatibility, be escaped using either ">
" or a character reference when it appears in the string "]]>
" in content, when that string is not marking the end of a CDATA section.To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (
'
) may be represented as "'
", and the double-quote character ("
) as ""
".
我读前者是这样说的
必须是:
-
<
(<
) 必须是 -
&
(&
) 必须是
可能,但当显示为 ]]>
时必须
-
>
(>
) 必须是,如果显示为]]>
还有那个'
和 "
根本不必转义;除非你想在引用属性中使用引号。
来自 HTML 4.01 Specification, HTML Document Representation :
5.3.2 Character entity references
Authors wishing to put the "
<
" character in text should use "<
" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).Similarly, authors should use "
>
" (ASCII decimal 62) in text instead of ">
" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.Authors should use "
&
" (ASCII decimal 38) instead of "&
" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&
" in attribute values since character references are allowed within CDATA attribute values.Some authors use the character entity reference "
"
" to encode instances of the double quote mark ("
) since that character may be used to delimit attribute values.
HTML 在规则上更加空泛,但听起来我应该:
-
<
应该与<
一起 -
>
应该与>
一起 -
&
应该与&
一起 -
"
应该与"
一起
如果"
可以是实体引用,我也应该替换 '
与 &
.
更新二
来自 HTML5 - A vocabulary and associated APIs for HTML and XHTML :
8.3 Serializing HTML fragments
Escaping a string (for the purposes of the algorithm above) consists of running the following steps:
Replace any occurrence of the "
&
" character by the string "&
".Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "
".If the algorithm was invoked in the attribute mode, replace any occurrences of the "
"
" character by the string ""
".If the algorithm was not invoked in the attribute mode, replace any occurrences of the "
<
" character by the string "<
", and any occurrences of the ">
" character by the string ">
".
我读作HTML:
-
&
通过&
总是 -
通过
总是 -
"
通过"
如果它在属性内 -
<
通过<
如果它不在属性中(即属性可以包含<
) -
>
通过>
如果它不在属性中(即属性可以包含>
)
最佳答案
首先,您要比较 HTML 4.01 specification用 HTML 5 one . HTML5 与 XML 的联系比 HTML 4.01 更紧密(这就是我们拥有 XHTML 的原因),所以这个答案将坚持 HTML 5 和 XML。
您引用的引用文献在以下几点上都是一致的:
-
<
应始终用<
表示当不指示处理指令时 -
>
应始终用>
表示当不指示处理指令时 -
&
应始终用&
表示 - except 在
<![CDATA[ ]]>
范围内(仅适用于 XML)
我 100% 同意这一点。您绝不希望解析器将文字误认为是指令,因此始终对任何非空格(见下文)字符进行编码是一个不错的主意。好的解析器知道 <![CDATA[ ]]>
中包含的任何内容不是指令,所以这里不需要编码。
实际上,我从不编码 '
或 "
除非
- 它出现在属性值(XML 或 HTML)中
- 它出现在 XML 标记的文本中。 (
<tag>"Yoinks!", he said.</tag>
)
两个规范也同意这一点。
所以,唯一的争论点是 (空间)。在这两个规范中唯一提到的是在尝试序列化时。如果不是,您应该始终使用文字
(空间)。除非您正在编写自己的解析器,否则我认为不需要进行任何类型的序列化,所以这不是重点。
关于xml - 哪些是 HTML 和 XML 特殊字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7248958/