python - 即使在 pretty_print=True 时,使用 lxml 编写也不会发出空格

标签 python xml xml-parsing lxml

我正在使用 lxml 库读取 xml 模板,插入/更改一些元素,并保存生成的 xml。我使用 etree.Elementetree.SubElement 方法即时创建的元素之一:

tree = etree.parse(r'xml_archive\templates\metadata_template_pts.xml')
root = tree.getroot()

stream = []
for element in root.iter():
    if isinstance(element.tag, basestring):
        stream.append(element.tag)

        # Find "keywords" element and insert a new "theme" element
        if element.tag == 'keywords' and 'theme' not in stream:
            theme = etree.Element('theme')
            themekt = etree.SubElement(theme, 'themekt').text = 'None'
            for tk in themekeys:
                themekey = etree.SubElement(theme, 'themekey').text = tk
            element.insert(0, theme)

很好地打印到屏幕print etree.tostring(theme, pretty_print=True):

<theme>
  <themekt>None</themekt>
  <themekey>Hydrogeology</themekey>
  <themekey>Stratigraphy</themekey>
  <themekey>Floridan aquifer system</themekey>
  <themekey>Geology</themekey>
  <themekey>Regional Groundwater Availability Study</themekey>
  <themekey>USGS</themekey>
  <themekey>United States Geological Survey</themekey>
  <themekey>thickness</themekey>
  <themekey>altitude</themekey>
  <themekey>extent</themekey>
  <themekey>regions</themekey>
  <themekey>upper confining unit</themekey>
  <themekey>FAS</themekey>
  <themekey>base</themekey>
  <themekey>geologic units</themekey>
  <themekey>geology</themekey>
  <themekey>extent</themekey>
  <themekey>inlandWaters</themekey>
</theme>

但是,当使用 etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True) 写出 xml 时,此元素在输出文件中被展平:

<theme><themekt>None</themekt><themekey>Hydrogeology</themekey><themekey>Stratigraphy</themekey><themekey>Floridan aquifer system</themekey><themekey>Geology</themekey><themekey>Regional Groundwater Availability Study</themekey><themekey>USGS</themekey><themekey>United States Geological Survey</themekey><themekey>thickness</themekey><themekey>altitude</themekey><themekey>extent</themekey><themekey>regions</themekey><themekey>upper confining unit</themekey><themekey>FAS</themekey><themekey>base</themekey><themekey>geologic units</themekey><themekey>geology</themekey><themekey>extent</themekey><themekey>inlandWaters</themekey></theme>

文件的其余部分写得很好,但是这个特定元素造成了(纯粹是美学上的)麻烦。对我做错了什么有什么想法吗?


下面是模板 xml 文件中的一段标记(将其另存为“template.xml”以使用底部的代码片段运行)。仅当我解析现有文件并插入新元素时才会出现标签展平,而不是在使用 lxml 从头创建 xml 时。

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="fgdc_classic.xsl"?>
<metadata xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:noNamespaceSchemaLocation="http://water.usgs.gov/GIS/metadata/usgswrd/fgdc-std-001-1998.xsd">
    <keywords>
       <theme>
            <themekt>ISO 19115 Topic Categories</themekt>
            <themekey>environment</themekey>
            <themekey>geoscientificInformation</themekey>
            <themekey>inlandWaters</themekey>
        </theme>
        <place>
            <placekt>None</placekt>
            <placekey>Florida</placekey>
            <placekey>Georgia</placekey>
            <placekey>Alabama</placekey>
            <placekey>South Carolina</placekey>
        </place>
    </keywords>

</metadata>

下面是与标记片段(上方)一起使用的代码片段:

# Create new theme element to insert into root
themekeys = ['Hydrogeology', 'Stratigraphy', 'inlandWaters']

tree = etree.parse(r'template.xml')
root = tree.getroot()

stream = []
for element in root.iter():
    if isinstance(element.tag, basestring):
        stream.append(element.tag)

        # Edit theme keywords
        if element.tag == 'keywords':
            theme = etree.Element('theme')
            themekt = etree.SubElement(theme, 'themekt').text = 'None'
            for tk in themekeys:
                themekey = etree.SubElement(theme, 'themekey').text = tk
            element.insert(0, theme)

# Write XML to new file
out_xml_file = 'test.xml'
etree.ElementTree(root).write(out_xml_file, method='xml', pretty_print=True)
with open(out_xml_file, 'r') as f:
    lines = f.readlines()

with open(out_xml_file, 'w') as f:
    f.write('<?xml version="1.0" encoding="UTF-8"?>\n')
    for line in lines:
        f.write(line)

最佳答案

如果你替换这一行:

tree = etree.parse(r'template.xml')

这些行:

parser = etree.XMLParser(remove_blank_text=True)
tree = etree.parse(r'template.xml', parser)

然后它将按预期工作。诀窍是使用 XMLParser remove_blank_text 选项设置为 True。任何现有的可忽略空格都将被删除,因此不会破坏后续的 pretty-print 。

关于python - 即使在 pretty_print=True 时,使用 lxml 编写也不会发出空格,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31274414/

相关文章:

python - Peewee:如何从数组字段为空的 Postgres 中选择数据?

python - 如何识别 kmeans scikit 学习中的集群标签

Python:编辑字典键 - 使用 Strip 方法

python - Emacs 中 python 模式下的 M-x 断点

java - 当我需要 DocumentBuilder 时使用 SAX 解析器

java - 关于JAVA中使用SAX解析XML的问题

python - 如何从 XML 文件中获取数据?

c# - 将 XML 文档转换为对象

c# - 系统.Xml.XPath.XPathException : Expression must evaluate to a node-set when executing SelectSingleNode ("//(artist|author)")

python - 解析 XML,搜索目标起始 <row> 标记,并忽略其上方的所有 <row> 标记