xml - 哪些是 HTML 和 XML 特殊字符?

标签 xml http special-characters htmlspecialchars entityreference

HTML 和 XML 中的特殊保留字符实体是什么?

我所掌握的信息是这样的:

HTML:

  • & (替换为 & )
  • < (替换为 &lt; )
  • <罢工> > (替换为 &gt; )
  • " (替换为 &quot; )
  • <罢工> ' (替换为 &apos; )

XML:

  • < (替换为 &lt; )
  • > (替换为 &gt; )
  • & (替换为 &amp; )
  • ' (替换为 &apos; )
  • " (替换为 &quot; )

但我找不到关于这些的文档。

W3C 在 Extensible Markup Language (XML) 1.0 (Fifth Edition) 中确实提到了,某些预定义的实体引用。但它说这些实体是预定义的(与预定义 &copy; 的方式相同);并不是说它们必须被转义:

4.6 Predefined Entities

[Definition: Entity and character references may both be used to escape the left angle bracket, ampersand, and other delimiters. A set of general entities (amp, lt, gt, apos, quot) is specified for this purpose. Numeric character references may also be used; they are expanded immediately when recognized and must be treated as character data, so the numeric character references " &#60; " and " &#38; " may be used to escape < and & when they occur in character data.]

哪些字符必须转义为HTML中的实体引用? 哪些字符必须转义为XML中的实体引用?


更新:

来自 Extensible Markup Language (XML) 1.0 (Fifth Edition) :

2.4 Character Data and Markup

The ampersand character (&) and the left angle bracket (<) must not appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section. If they are needed elsewhere, they must be escaped using either numeric character references or the strings "&amp;" and "&lt;" respectively.

The right angle bracket (>) may be represented using the string "&gt;", and must, for compatibility, be escaped using either "&gt;" or a character reference when it appears in the string "]]>" in content, when that string is not marking the end of a CDATA section.

To allow attribute values to contain both single and double quotes, the apostrophe or single-quote character (') may be represented as "&apos;", and the double-quote character (") as "&quot;".

我读前者是这样说的

必须是:

  • < ( &lt; ) 必须是
  • & ( &amp; ) 必须是

可能,但当显示为 ]]>必须

  • > ( &gt; ) 必须是,如果显示为 ]]>

还有那个'"根本不必转义;除非你想在引用属性中使用引号。


来自 HTML 4.01 Specification, HTML Document Representation :

5.3.2 Character entity references

Authors wishing to put the "<" character in text should use "&lt;" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter).

Similarly, authors should use "&gt;" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.

Authors should use "&amp;" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&amp;" in attribute values since character references are allowed within CDATA attribute values.

Some authors use the character entity reference "&quot;" to encode instances of the double quote mark (") since that character may be used to delimit attribute values.

HTML 在规则上更加空泛,但听起来我应该:

  • <应该与 &lt; 一起
  • >应该与 &gt; 一起
  • &应该与 &amp; 一起
  • "应该与 &quot; 一起

如果"可以是实体引用,我也应该替换 '&amp; .


更新二

来自 HTML5 - A vocabulary and associated APIs for HTML and XHTML :

8.3 Serializing HTML fragments

Escaping a string (for the purposes of the algorithm above) consists of running the following steps:

Replace any occurrence of the "&" character by the string "&amp;".

Replace any occurrences of the U+00A0 NO-BREAK SPACE character by the string "&nbsp;".

If the algorithm was invoked in the attribute mode, replace any occurrences of the """ character by the string "&quot;".

If the algorithm was not invoked in the attribute mode, replace any occurrences of the "<" character by the string "&lt;", and any occurrences of the ">" character by the string "&gt;".

我读作HTML:

  • &通过 &amp;总是
  • 通过 &nbsp;总是
  • "通过 &quot;如果它在属性内
  • <通过 &lt;如果它在属性中(即属性可以包含 < )
  • >通过 &gt;如果它在属性中(即属性可以包含 > )

最佳答案

首先,您要比较 HTML 4.01 specificationHTML 5 one . HTML5 与 XML 的联系比 HTML 4.01 更紧密(这就是我们拥有 XHTML 的原因),所以这个答案将坚持 HTML 5 和 XML。

您引用的引用文献在以下几点上都是一致的:

  • <应始终用 &lt; 表示当不指示处理指令时
  • >应始终用 &gt; 表示当不指示处理指令时
  • &应始终用 &amp; 表示
  • except<![CDATA[ ]]> 范围内(仅适用于 XML)

我 100% 同意这一点。您绝不希望解析器将文字误认为是指令,因此始终对任何非空格(见下文)字符进行编码是一个不错的主意。好的解析器知道 <![CDATA[ ]]> 中包含的任何内容不是指令,所以这里不需要编码。

实际上,我从不编码 '"除非

  • 它出现在属性值(XML 或 HTML)中
  • 它出现在 XML 标记的文本中。 ( <tag>&quot;Yoinks!&quot;, he said.</tag> )

两个规范也同意这一点。

所以,唯一的争论点是 (空间)。在这两个规范中唯一提到的是在尝试序列化时。如果不是,您应该始终使用文字 (空间)。除非您正在编写自己的解析器,否则我认为不需要进行任何类型的序列化,所以这不是重点。

关于xml - 哪些是 HTML 和 XML 特殊字符?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/7248958/

相关文章:

php - 如何从mysql中提取包含特殊字符的数据?

php - 使用 php 在表格中显示 XML 内容

javascript - 如何将 Å、Ä 和 Ö 放入 javascript 数组中,然后将其与 html 文本进行比较?

java - 为什么按钮看起来很奇怪?

node.js - 状态错误 : Insecure HTTP is not allowed by platform:

rest - RFC - 404 或 400 用于在 PUT 请求中找不到实体的关系

c# - HTTP 凭据 - 为什么先转换为字节然后再转换为字符串?

java - Android:接受正则表达式中的所有特殊字符

c# - DataContractSerializer 如何写入私有(private)字段?

xml - 在 WCF 中修改 XML 响应;损坏的 XML