python-2.7 - 为什么 ElementTree 拒绝带有 "encoding incorrect"的 UTF-16 XML 声明？

在 Python 2.7 中，当将 unicode 字符串传递给 ElementTree 的 fromstring() 方法(该方法在 XML 声明中具有 encoding="UTF-16")时，我得到ParseError 表示指定的编码不正确:

>>> from xml.etree import ElementTree
>>> data = u'<?xml version="1.0" encoding="utf-16"?><root/>'
>>> ElementTree.fromstring(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1300, in XML
    parser.feed(text)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "C:\Program Files (x86)\Python 2.7\lib\xml\etree\ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: encoding specified in XML declaration is incorrect: line 1, column 30

这是什么意思？是什么让 ElementTree 这么认为？

毕竟，我传递的是 unicode 代码点，而不是字节字符串。这里不涉及编码。怎么会不正确呢？

当然，有人可能会认为任何编码都是不正确的，因为这些 unicode 代码点没有被编码。但是，为什么 UTF-8 没有被拒绝为“不正确的编码”呢？

>>> ElementTree.fromstring(u'<?xml version="1.0" encoding="utf-8"?><root/>')

我可以通过将 unicode 字符串编码为 UTF-16 编码的字节字符串并将其传递给 fromstring() 或替换 encoding="utf-16 来轻松解决此问题" 与 unicode 字符串中的 encoding="utf-8" ，但我想了解为什么会引发该异常。 documentation of ElementTree没有说只接受字节字符串。

具体来说，我希望避免这些额外的操作，因为我的输入数据可能会变得非常大，并且我希望避免它们在内存中出现两次，以及避免处理它们的 CPU 开销超过绝对必要。

最佳答案

我不会试图证明这种行为的合理性，而是解释为什么编写的代码实际上会发生这种情况。

简而言之:Python 使用的 XML 解析器，expat ，对字节进行操作，而不是 unicode 字符。您必须调用.encode('utf-16-be')或.encode('utf-16-le')在将字符串传递给 ElementTree.fromstring 之前:

ElementTree.fromstring(data.encode('utf-16-be'))

<小时/>

证明:ElementTree.fromstring最终调用pyexpat.xmlparser.Parse ，在 pyexpat.c 中实现:

static PyObject *
xmlparse_Parse(xmlparseobject *self, PyObject *args)
{
    char *s;
    int slen;
    int isFinal = 0;

    if (!PyArg_ParseTuple(args, "s#|i:Parse", &s, &slen, &isFinal))
        return NULL;

    return get_parse_result(self, XML_Parse(self->itself, s, slen, isFinal));
}

因此，您传入的 unicode 参数将使用 s# 进行转换。 docs对于 PyArg_ParseTuple说:

s# (string, Unicode or any read buffer compatible object) [const char *, int (or Py_ssize_t, see below)] This variant on s stores into two C variables, the first one a pointer to a character string, the second one its length. In this case the Python string may contain embedded null bytes. Unicode objects pass back a pointer to the default encoded string version of the object if such a conversion is possible. All other read-buffer compatible objects pass back a reference to the raw internal data representation.

让我们看看:

from xml.etree import ElementTree
data = u'<?xml version="1.0" encoding="utf-8"?><root>\u2163</root>'
print ElementTree.fromstring(data)

给出错误:

UnicodeEncodeError: 'ascii' codec can't encode character u'\u2163' in position 44: ordinal not in range(128)

这意味着当您指定encoding="utf-8"时，您只是很幸运，当 Unicode 字符串编码为 ASCII 时，您的输入中没有非 ASCII 字符。如果在解析之前添加以下内容，UTF-8 将按该示例的预期工作:

import sys
reload(sys).setdefaultencoding('utf8')

但是，将默认编码设置为“utf-16-be”或“utf-16-le”不起作用，因为 ElementTree 的 Python 位会进行直接字符串比较，而这在 UTF-16 环境中开始失败.

关于python-2.7 - 为什么 ElementTree 拒绝带有 "encoding incorrect"的 UTF-16 XML 声明？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/24045892/

python-2.7 - 为什么 ElementTree 拒绝带有 "encoding incorrect"的 UTF-16 XML 声明？

上一篇：debugging - 有没有办法在 Valgrind 发现第一个错误时停止它？

下一篇：scala - 为什么这个在参数中分配 val 的 Scala 代码可以工作？