给一个 XML 文件,有没有办法使用 lxml
获取所有叶节点及其名称和属性?
这是感兴趣的 XML 文件:
<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
<!-- This xml conforms to an XML Schema at:
http://clinicaltrials.gov/ct2/html/images/info/public.xsd
and an XML DTD at:
http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
<id_info>
<org_study_id>3370-2(-4)</org_study_id>
<nct_id>NCT00753818</nct_id>
<nct_alias>NCT00222157</nct_alias>
</id_info>
<brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
<sponsors>
<lead_sponsor>
<agency>Mead Johnson Nutrition</agency>
<agency_class>Industry</agency_class>
</lead_sponsor>
</sponsors>
<source>Mead Johnson Nutrition</source>
<oversight_info>
<authority>United States: Institutional Review Board</authority>
</oversight_info>
<brief_summary>
<textblock>
The purpose of this study is to compare the effects on visual development, growth, cognitive
development, tolerance, and blood chemistry parameters in term infants fed one of four study
formulas containing various levels of DHA and ARA.
</textblock>
</brief_summary>
<overall_status>Completed</overall_status>
<phase>N/A</phase>
<study_type>Interventional</study_type>
<study_design>N/A</study_design>
<primary_outcome>
<measure>visual development</measure>
</primary_outcome>
<secondary_outcome>
<measure>Cognitive development</measure>
</secondary_outcome>
<number_of_arms>4</number_of_arms>
<condition>Cognitive Development</condition>
<condition>Growth</condition>
<arm_group>
<arm_group_label>1</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>2</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>3</arm_group_label>
<arm_group_type>Experimental</arm_group_type>
</arm_group>
<arm_group>
<arm_group_label>4</arm_group_label>
<arm_group_type>Other</arm_group_type>
<description>Control</description>
</arm_group>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>DHA and ARA</intervention_name>
<description>various levels of DHA and ARA</description>
<arm_group_label>1</arm_group_label>
<arm_group_label>2</arm_group_label>
<arm_group_label>3</arm_group_label>
</intervention>
<intervention>
<intervention_type>Other</intervention_type>
<intervention_name>Control</intervention_name>
<arm_group_label>4</arm_group_label>
</intervention>
</clinical_study>
我想要的是一个看起来像这样的字典:
{
'id_info_org_study_id': '3370-2(-4)',
'id_info_nct_id': 'NCT00753818',
'id_info_nct_alias': 'NCT00222157',
'brief_title': 'Developmental Effects...'
}
这对 lxml 或任何其他 Python 库是否可行?
更新:
我最终是这样做的:
response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})
def _recurse_over_nodes(self, tree, parent_key, data):
for branch in tree:
key = branch.tag
if branch.getchildren():
if parent_key:
key = '%s_%s' % (parent_key, key)
data = self._recurse_over_nodes(branch, key, data)
else:
if parent_key:
key = '%s_%s' % (parent_key, key)
if key in data:
data[key] = data[key] + ', %s' % branch.text
else:
data[key] = branch.text
return data
最佳答案
使用iter
方法。
http://lxml.de/api/lxml.etree._Element-class.html#iter
这是一个功能示例。
#!/usr/bin/python
from lxml import etree
xml='''
<book>
<chapter id="113">
<sentence id="1" drums='Neil'>
<word id="128160" bass='Geddy'>
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPV"/>
<Number type="S"/>
</word>
<word id="128161">
<POS Tag="V"/>
<grammar type="STEM"/>
<Aspect type="IMPF"/>
</word>
</sentence>
<sentence id="2">
<word id="128162">
<POS Tag="P"/>
<grammar type="PREFIX"/>
<Tag Tag="bi+"/>
</word>
</sentence>
</chapter>
</book>
'''
filename='/usr/share/sri/configurations/saved/test1.xml'
if __name__ == '__main__':
root = etree.fromstring(xml)
# iter will return every node in the document
#
for node in root.iter('*'):
# nodes of length zero are leaf nodes
#
if 0 == len(node):
print node
这是输出:
$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>
关于python - lxml:获取所有叶节点?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29567684/