python - 解析 XML 并写入 CSV 文件

标签 python xml python-2.7 csv minidom

我正在使用我编写的简单脚本(经过一些调整)解析一个简单的 XML 文档。这是 XML:

<?xml version="1.0" ?>
<library owner="John Franks">
 <book>
  <title>Sandman Volume 1: Preludes and Nocturnes</title>
  <author>Neil Gaiman</author>
 </book>
 <book>
  <title>Good Omens</title>
  <author>Neil Gamain</author>
  <author>Terry Pratchett</author>
 </book>
 <book>
  <title>The Man And The Goat</title>
  <author>Bubber Elderidge</author>
 </book>
 <book>
  <title>Once Upon A Time in LA</title>
  <author>Dr Dre</author>
 </book>
 <book>
  <title>There Will Never Be Justice</title>
  <author>IR Jury</author>
 </book>
 <book>
  <title>Beginning Python</title>
  <author>Peter Norton, et al</author>
 </book>
</library>

这是我的 Python 脚本:

from xml.dom.minidom import parse
import xml.dom.minidom
import csv

def writeToCSV(myLibrary):
  csvfile = open('output.csv', 'w')
  fieldnames = ['title', 'author', 'author']
  writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
  writer.writeheader()
  
  books = myLibrary.getElementsByTagName("book")
  for book in books:
    titleValue = book.getElementsByTagName("title")[0].childNodes[0].data
    for author in book.getElementsByTagName("author"):
      authorValue = author.childNodes[0].data
      writer.writerow({'title': titleValue, 'author': authorValue})


doc = parse('library.xml')
myLibrary = doc.getElementsByTagName("library")[0]

# Get book elements in Library
books = myLibrary.getElementsByTagName("book")

# Print each book's title
writeToCSV(myLibrary)

这是我的输出:

title,author

Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman

Good Omens,Neil Gamain

Good Omens,Terry Pratchett

The Man And The Goat,Bubber Elderidge

Once Upon A Time in LA,Dr Dre

There Will Never Be Justice,IR Jury

Beginning Python,"Peter Norton, et al"

请注意,“Good Omens” 这本书有 2 位作者,并且显示在两个单独的行中。我真正想要的是它显示如下:

title,author,author

Sandman Volume 1: Preludes and Nocturnes,Neil Gaiman,,

Good Omens,Neil Gamain,Terry Pratchett

The Man And The Goat,Bubber Elderidge,,

Once Upon A Time in LA,Dr Dre,,

There Will Never Be Justice,IR Jury,,

Beginning Python,"Peter Norton, et al",,

正如您所看到的,有 3 列,因此两位作者显示在同一行。那些只有一位作者的书,只有一个空白条目,因此两个逗号相邻。

最佳答案

解决问题的一个好方法是使用 lxml:

>>> with open('doc.xml') as f:
>>>     doc = etree.XML(f.read())
>>>     for e in doc.xpath('book'):
>>>         print (e.xpath('author/text()'), e.xpath('title/text()')[0])
(['Neil Gaiman'], 'Sandman Volume 1: Preludes and Nocturnes')
(['Neil Gamain', 'Terry Pratchett'], 'Good Omens')
(['Bubber Elderidge'], 'The Man And The Goat')
(['Dr Dre'], 'Once Upon A Time in LA')
(['IR Jury'], 'There Will Never Be Justice')
(['Peter Norton, et al'], 'Beginning Python')

然后要生成 csv,您可以执行以下操作:

 with open('output.csv', 'w') as fout:
      fieldnames = ['title', 'authors']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
      for e in doc.xpath('book'):
         title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
         writer.writerow({'title': titleValue, 'author': authors.join(';')})

或者:

  with open('output.csv', 'w') as fout:
      fieldnames = ['title', 'author1', 'author2']
      writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
      writer.writeheader()
      for e in doc.xpath('book'):
         title, authors = e.xpath('author/text()'), e.xpath('title/text()')[0]
         author1, author2 = '', ''
         if len(authors) == 2:
             author2 = author[1]
         if len(authors) == 1:
             author1 = author[0]
         writer.writerow({'title': titleValue, 'author1': author1, 'author2': author2})

关于python - 解析 XML 并写入 CSV 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29171937/

相关文章:

Python 中的 MySQLdb : "Can' t connect to MySQL server on 'localhost' "

python - 新行字符导致文本清理期间单词分解

android - 无法从 APK 解码 XML 文件?

python - 从 Python PYODBC 中的变量中选择列

xml - Svcutil 生成具有多个端点的错误配置

java - 增量/流式 XSLT 转换?

Python 脚本在 bash 中运行,但不在 cron 中运行?

python - ImportError : numpy. core.multiarray 导入失败

python - 在 Django 数据库中查找电话号码后添加其他字段

python - 字符串中的一个热点 - 获取唯一值列表中的索引