如何解析大型 XML 文件并将其元素处理为 ObjectifiedElement(使用 objectify 解析器)。
我没有找到比以下更好的解决方案:
from lxml import etree, objectify
for event, elt in etree.iterparse('onebigfile.xml', tag='MyTag'):
oelt = objectify.fromstring(etree.tostring(elt))
my_process(oelt)
如何避免这种中间字符串表示形式?
最佳答案
我觉得真的很好用iterparse
构建自定义数据提取器,完全无需使用 objectify。
为了这个示例,我使用了一个看起来有点像这样的 .NET 引用 XML 文件:
<doc>
<assembly>
<name>System.IO</name>
</assembly>
<members>
<member name="T:System.IO.BinaryReader">
<summary>Reads primitive data types as binary values in a specific encoding.</summary>
<filterpriority>2</filterpriority>
</member>
<member name="M:System.IO.BinaryReader.#ctor(System.IO.Stream)">
<summary>Initializes a new instance of the <see cref="T:System.IO.BinaryReader" /> class based on the specified stream and using UTF-8 encoding.</summary>
<param name="input">The input stream. </param>
<exception cref="T:System.ArgumentException">The stream does not support reading, is null, or is already closed. </exception>
</member>
<member name="M:System.IO.BinaryReader.#ctor(System.IO.Stream,System.Text.Encoding)">
<summary>Initializes a new instance of the <see cref="T:System.IO.BinaryReader" /> class based on the specified stream and character encoding.</summary>
<param name="input">The input stream. </param>
<param name="encoding">The character encoding to use. </param>
<exception cref="T:System.ArgumentException">The stream does not support reading, is null, or is already closed. </exception>
<exception cref="T:System.ArgumentNullException">
<paramref name="encoding" /> is null. </exception>
</member>
<!-- ... many more members like this -->
</members>
</doc>
假设您想要将所有成员及其名称、摘要和属性提取为像这样的字典列表:
{
'summary': 'Reads primitive data types as binary values in a specific encoding.',
'name': 'T:System.IO.BinaryReader'
}
{
'summary': 'Initializes a new instance of the ',
'@input': 'The input stream. ',
'name': 'M:System.IO.BinaryReader.#ctor(System.IO.Stream)'
}
{
'summary': 'Initializes a new instance of the class based on the specified stream and using UTF-8 encoding.',
'@input': 'The input stream. ',
'@encoding': 'The character encoding to use. ',
'name': 'M:System.IO.BinaryReader.#ctor(System.IO.Stream,System.Text.Encoding)'
}
你可以这样做:
- 使用
lxml.iterparse
与start
和end
事件 - 当
<member>
元素开始,准备一个新的字典(item
) - 当我们在一个
<member>
元素,将我们感兴趣的任何内容添加到字典中 - 当
<member>
元素结束,完成字典并产生它 - 设置
item
至None
用作“<member>
的内部/外部”-flag
在代码中:
import lxml
from lxml import etree
def text_content(elt):
return ' '.join([t.strip() for t in elt.itertext()])
def extract_data(xmlfile):
item = None
for event, elt in etree.iterparse(xmlfile, events=['start', 'end']):
if elt.tag == 'member':
if event == 'start':
item = {}
else:
item['name'] = elt.attrib['name']
yield item
item = None
if item == None:
continue
if event == 'end':
if elt.tag in ('summary', 'returns'):
item[elt.tag] = text_content(elt)
continue
if elt.tag == 'param':
item['@' + elt.attrib['name']] = text_content(elt)
continue
testfile = r'C:\Program Files (x86)\Reference Assemblies\Microsoft\Framework\.NETCore\v4.5.1\System.IO.xml'
for item in extract_data(testfile):
print(item)
通过这种方式,您可以获得最快和最节省内存的解析,并可以很好地控制您查看的数据。使用 objectify
会比没有中间体更浪费 tostring()
/fromstring()
.
关于python - lxml iterparse 与 objectify,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/49880545/