在以下位置有一个 XML 和 HTML 字符引用列表:https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references .
然而,有些东西根本没有在该列表中定义,但它们在旧的 HTML 脚本中使用过。当我处理来自 http://www.d.umn.edu/~tpederse/data.html 的 Senseval-2 格式(带有修复)
数据集时,我遇到了以下单词,它破坏了我试图使用 xml.et.elementTree
解析数据的脚本。
这些词的 unicode 等效项是什么?
&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
°ree.
°ree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
×.
我的脚本:
import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()
给出这个回溯:
Traceback (most recent call last):
File "senseval.py", line 4, in <module>
tree = et.parse(s1)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113
最佳答案
“单词”看起来格式不正确entity references .有效的实体引用在末尾有一个分号。我查看了 test-fix.xml(在 Sval1to2.fix.tar.gz 中),看起来很可能 &dash
(或 &dash.
) 表示某种破折号或连字符。该文件具有 .xml
扩展名,如果修复了错误的实体引用,它将非常接近于格式良好的 XML。
在您链接到的页面上(http://www.d.umn.edu/~tpederse/data.html),它说:
Please note that our converted data will not "parse" as true xml text. This is due to the fact that in the original sense-tagged text, characters that require special handling in xml are not escaped, and so forth. We are considering ways to make this data "true" xml, and would be most grateful for any feedback on how to best do this.
因此,尽管该文档看起来非常像 XML,但它并不是 XML,发布它的人也很清楚这一点。
关于python - 将 XML 非法 &char 转换为 utf8 - python,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/19030728/