python - lxml:获取所有叶节点？

给一个 XML 文件，有没有办法使用 lxml 获取所有叶节点及其名称和属性？

这是感兴趣的 XML 文件:

<?xml version="1.0" encoding="UTF-8"?>
<clinical_study>
  <!-- This xml conforms to an XML Schema at:
    http://clinicaltrials.gov/ct2/html/images/info/public.xsd
 and an XML DTD at:
    http://clinicaltrials.gov/ct2/html/images/info/public.dtd -->
  <id_info>
    <org_study_id>3370-2(-4)</org_study_id>
    <nct_id>NCT00753818</nct_id>
    <nct_alias>NCT00222157</nct_alias>
  </id_info>
  <brief_title>Developmental Effects of Infant Formula Supplemented With LCPUFA</brief_title>
  <sponsors>
    <lead_sponsor>
      <agency>Mead Johnson Nutrition</agency>
      <agency_class>Industry</agency_class>
    </lead_sponsor>
  </sponsors>
  <source>Mead Johnson Nutrition</source>
  <oversight_info>
    <authority>United States: Institutional Review Board</authority>
  </oversight_info>
  <brief_summary>
    <textblock>
      The purpose of this study is to compare the effects on visual development, growth, cognitive
      development, tolerance, and blood chemistry parameters in term infants fed one of four study
      formulas containing various levels of DHA and ARA.
    </textblock>
  </brief_summary>
  <overall_status>Completed</overall_status>
  <phase>N/A</phase>
  <study_type>Interventional</study_type>
  <study_design>N/A</study_design>
  <primary_outcome>
    <measure>visual development</measure>
  </primary_outcome>
  <secondary_outcome>
    <measure>Cognitive development</measure>
  </secondary_outcome>
  <number_of_arms>4</number_of_arms>
  <condition>Cognitive Development</condition>
  <condition>Growth</condition>
  <arm_group>
    <arm_group_label>1</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>2</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>3</arm_group_label>
    <arm_group_type>Experimental</arm_group_type>
  </arm_group>
  <arm_group>
    <arm_group_label>4</arm_group_label>
    <arm_group_type>Other</arm_group_type>
    <description>Control</description>
  </arm_group>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>DHA and ARA</intervention_name>
    <description>various levels of DHA and ARA</description>
    <arm_group_label>1</arm_group_label>
    <arm_group_label>2</arm_group_label>
    <arm_group_label>3</arm_group_label>
  </intervention>
  <intervention>
    <intervention_type>Other</intervention_type>
    <intervention_name>Control</intervention_name>
    <arm_group_label>4</arm_group_label>
  </intervention>
</clinical_study>

我想要的是一个看起来像这样的字典:

{
   'id_info_org_study_id': '3370-2(-4)', 
   'id_info_nct_id': 'NCT00753818', 
   'id_info_nct_alias': 'NCT00222157', 
   'brief_title': 'Developmental Effects...'
}

这对 lxml 或任何其他 Python 库是否可行？

更新:

我最终是这样做的:

response = requests.get(url)
tree = lxml.etree.fromstring(response.content)
mydict = self._recurse_over_nodes(tree, None, {})

def _recurse_over_nodes(self, tree, parent_key, data):
    for branch in tree:
        key = branch.tag
        if branch.getchildren():
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            data = self._recurse_over_nodes(branch, key, data)
        else:
            if parent_key:
                key = '%s_%s' % (parent_key, key)
            if key in data:
                data[key] = data[key] + ', %s' % branch.text
            else:
                data[key] = branch.text
    return data

最佳答案

使用iter方法。

http://lxml.de/api/lxml.etree._Element-class.html#iter

这是一个功能示例。

#!/usr/bin/python
from lxml import etree

xml='''
<book>
    <chapter id="113">

        <sentence id="1" drums='Neil'>
            <word id="128160" bass='Geddy'>
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPV"/>
                <Number type="S"/>
            </word>
            <word id="128161">
                <POS Tag="V"/>
                <grammar type="STEM"/>
                <Aspect type="IMPF"/>
            </word>
        </sentence>

        <sentence id="2">
            <word id="128162">
                <POS Tag="P"/>
                <grammar type="PREFIX"/>
                <Tag Tag="bi+"/>
            </word>
        </sentence>

    </chapter>
</book>
'''

filename='/usr/share/sri/configurations/saved/test1.xml'

if __name__ == '__main__':
    root = etree.fromstring(xml)

    # iter will return every node in the document
    #
    for node in root.iter('*'):

        # nodes of length zero are leaf nodes
        #
        if 0 ==  len(node):
            print node

这是输出:

$ ./verifyXmlWithDirs.py
<Element POS at 0x176dcf8>
<Element grammar at 0x176da70>
<Element Aspect at 0x176dc20>
<Element Number at 0x176dcf8>
<Element POS at 0x176dc20>
<Element grammar at 0x176dcf8>
<Element Aspect at 0x176da70>
<Element POS at 0x176da70>
<Element grammar at 0x176dc20>
<Element Tag at 0x176dcf8>

关于python - lxml:获取所有叶节点？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29567684/

python - lxml:获取所有叶节点？

上一篇：xml - 为什么 normalize-space() 不去除所有空格？

下一篇：xml - 使用 <Angle Brackets> 批量写入文本文件