python - 如何将 .txt 文件解析为 .xml?

标签 python xml python-2.7 parsing elementtree

这是我的txt文件:

In File Name:   C:\Users\naqushab\desktop\files\File 1.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 1.m2
In File Size:   Low:    22636   High:   0
Total Process time: 1.859000
Out File Size:  Low:    77619   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 2.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 2.m2
In File Size:   Low:    20673   High:   0
Total Process time: 3.094000
Out File Size:  Low:    94485   High:   0

In File Name:   C:\Users\naqushab\desktop\files\File 3.m1
Out File Name:  C:\Users\naqushab\desktop\files\Output\File 3.m2
In File Size:   Low:    66859   High:   0
Total Process time: 3.516000
Out File Size:  Low:    217268  High:   0

我正在尝试将其解析为这样的 XML 格式:

<?xml version='1.0' encoding='utf-8'?>
<root>
    <filedata>
        <InFileName>File 1.m1</InFileName>
        <OutFileName>File 1.m2</OutFileName>
        <InFileSize>22636</InFileSize>
        <OutFileSize>77619</OutFileSize>
        <ProcessTime>1.859000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 2.m1</InFileName>
        <OutFileName>File 2.m2</OutFileName>
        <InFileSize>20673</InFileSize>
        <OutFileSize>94485</OutFileSize>
        <ProcessTime>3.094000</ProcessTime>
    </filedata>
    <filedata>
        <InFileName>File 3.m1</InFileName>
        <OutFileName>File 3.m2</OutFileName>
        <InFileSize>66859</InFileSize>
        <OutFileSize>217268</OutFileSize>
        <ProcessTime>3.516000</ProcessTime>
    </filedata>
</root>

这是我试图实现的代码(我使用的是 Python 2):

import re
import xml.etree.ElementTree as ET

rex = re.compile(r'''(?P<title>In File Name:
                       |Out File Name:
                       |In File Size:   Low:
                       |Total Process time:
                       |Out File Size:  Low:
                     )
                     (?P<value>.*)
                     ''', re.VERBOSE)

root = ET.Element('root')
root.text = '\n'    # newline before the celldata element

with open('Performance.txt') as f:
    celldata = ET.SubElement(root, 'filedata')
    celldata.text = '\n'    # newline before the collected element
    celldata.tail = '\n\n'  # empty line after the celldata element
    for line in f:
        # Empty line starts new celldata element (hack style, uggly)
        if line.isspace():
            celldata = ET.SubElement(root, 'filedata')
            celldata.text = '\n'
            celldata.tail = '\n\n'

        # If the line contains the wanted data, process it.
        m = rex.search(line)
        if m:
            # Fix some problems with the title as it will be used
            # as the tag name.
            title = m.group('title')
            title = title.replace('&', '')
            title = title.replace(' ', '')

            e = ET.SubElement(celldata, title.lower())
            e.text = m.group('value')
            e.tail = '\n'

# Display for debugging
ET.dump(root)

# Include the root element to the tree and write the tree
# to the file.
tree = ET.ElementTree(root)
tree.write('Performance.xml', encoding='utf-8', xml_declaration=True)

但是我得到的是空值,是否可以将此 txt 解析为 XML?

最佳答案

对你的正则表达式的更正:它应该是

m = re.search('(?P<title>(In File Name)|(Out File Name)|(In File Size: *Low)|(Total Process time)|(Out File Size: *Low)):(?P<value>.*)',line)

而不是你给的。因为在您的正则表达式中,In File Name|Out File Name 意味着,它将检查 In File Nam 后跟 eO 后跟 ut 文件名 等等。

建议,

您无需使用正则表达式即可完成。 xml.dom.minidom 可用于美化您的 xml 字符串。

为了更好地理解,我添加了内联评论!

Node.toprettyxml([indent=""[, newl=""[, encoding=""]]])

Return a pretty-printed version of the document. indent specifies the indentation string and defaults to a tabulator; newl specifies the string emitted at the end of each line and defaults to

编辑

import itertools as it
[line[0] for line in it.groupby(lines)]

you can use groupby of itertools package to group consucutive dedup in list lines

所以,

import xml.etree.ElementTree as ET
root = ET.Element('root')

with open('file1.txt') as f:
    lines = f.read().splitlines()

#add first subelement
celldata = ET.SubElement(root, 'filedata')

import itertools as it
#for every line in input file
#group consecutive dedup to one 
for line in it.groupby(lines):
    line=line[0]
    #if its a break of subelements  - that is an empty space
    if not line:
        #add the next subelement and get it as celldata
        celldata = ET.SubElement(root, 'filedata')
    else:
        #otherwise, split with : to get the tag name
        tag = line.split(":")
        #format tag name
        el=ET.SubElement(celldata,tag[0].replace(" ",""))
        tag=' '.join(tag[1:]).strip()
        
        #get file name from file path
        if 'File Name' in line:
            tag = line.split("\\")[-1].strip()
        elif 'File Size' in line:
            splist =  filter(None,line.split(" "))
            tag = splist[splist.index('Low:')+1]
            #splist[splist.index('High:')+1]
        el.text = tag

#prettify xml
import xml.dom.minidom as minidom
formatedXML = minidom.parseString(
                          ET.tostring(
                                      root)).toprettyxml(indent=" ",encoding='utf-8').strip()
# Display for debugging
print formatedXML

#write the formatedXML to file.
with open("Performance.xml","w+") as f:
    f.write(formatedXML)

输出: 性能.x​​ml

<?xml version="1.0" encoding="utf-8"?>
<root>
 <filedata>
  <InFileName>File 1.m1</InFileName>
  <OutFileName>File 1.m2</OutFileName>
  <InFileSize>22636</InFileSize>
  <TotalProcesstime>1.859000</TotalProcesstime>
  <OutFileSize>77619</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 2.m1</InFileName>
  <OutFileName>File 2.m2</OutFileName>
  <InFileSize>20673</InFileSize>
  <TotalProcesstime>3.094000</TotalProcesstime>
  <OutFileSize>94485</OutFileSize>
 </filedata>
 <filedata>
  <InFileName>File 3.m1</InFileName>
  <OutFileName>File 3.m2</OutFileName>
  <InFileSize>66859</InFileSize>
  <TotalProcesstime>3.516000</TotalProcesstime>
  <OutFileSize>217268</OutFileSize>
 </filedata>
</root>

希望对您有所帮助!

关于python - 如何将 .txt 文件解析为 .xml?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/42835956/

相关文章:

python - python中除零之外的列表排序

python - 如何从Oracle表中删除大量没有主键的记录

python - 值错误: Location based indexing can only have [labels (MUST BE IN THE INDEX)

c# - 如何摆脱子元素中的空命名空间 "xmlns="""?

python - 为 emcee 中的每个参数定义自定义先验

python-2.7 - 为 Sklearn reshape 数据

python - 如何使用 pandas.shift() 而不丢失数据?

C# 架构验证返回 false 成功

对于单个 Unicode 字符串,Python 返回长度为 2

c# - 反序列化 XML 的 SQL 查询