Python-读取格式不正确的xml文件

如果 XML 文件的名称属性 <,>,",' 中包含禁止字符，我如何读取该 XML 文件？ XML 超过 30k 行，目标是 pandas.dataframe

<rows>
<row number="164" item="9860404" name="160-30 Bracket" qty="1"/>
<row number="164" item="9860405" name="200-30 <> Bracket" qty="1" />
<row number="164" item="9860406" name="250-30 3/4" Bracket" qty="3" />
<row number="164" item="9860407" name="315-30 <-> Bracket" qty="4"/>
</rows>

最佳答案

您可以使用 HTMLParser 解析示例数据来自 lxml.etree 的解析器:

>>> from lxml import etree
>>> parser = etree.HTMLParser()
>>> doc =etree.parse(open('data.xml'), parser=parser)
>>> [elem.get('name') for elem in doc.xpath('//row')]
['160-30 Bracket', '200-30 <> Bracket', '250-30 3/4', '315-30 <-> Bracket']

请注意，使用 HTML 解析器解析数据会将文档包装在 <html> 中。和<body>元素，使文档结构最终看起来像:

<html><body><rows>
<row number="164" item="9860404" name="160-30 Bracket" qty="1"/>
<row number="164" item="9860405" name="200-30 &lt;&gt; Bracket" qty="1"/>
<row number="164" item="9860406" name="250-30 3/4" bracket="" qty="3"/>
<row number="164" item="9860407" name="315-30 &lt;-&gt; Bracket" qty="4"/>
</rows>
</body></html>

关于Python-读取格式不正确的xml文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/59586925/

上一篇：python3 itertools.filterfalse 非常慢。有哪些替代方案？

下一篇：sql - 反序列化要记录的文本

相关文章：

python - 将数字的 unicode 表示形式转换为 ascii 字符串

python - 处理大文件的最佳 Python Zip 模块是什么？

python - BeautifulSoup PYTHON - 内部标签

android - 在 Android 中将 XML 转换为 JSON 对象

python - 使用 ElementTree 解析 XML 文件的一部分时遇到困难

python - Keras:ValueError:logits 和标签必须具有相同的形状 ((None, 2) vs (None, 1))

objective-c - 项目未链接 (KissXML/iOS)

java - Java从上到下读取文件夹中的文件

python - 检索具有标签的元素 - Python 中的 XML

python - 将 XML 值存储为 Python 列表