python - 如何使用 Python 解析复杂的 XML

标签 python xml parsing xml-parsing export-to-csv

我正在将 XML 文件转换为 CSV 或 pandas 文件。 XML 中存在各种必需的类别和不需要的其他类别。有没有一种有效的方法来挑选代码中的信息,如下所示。这需要在相对较大的规模(>10,000 个文档)上完成。例如,我想获取“family-id”、“data”和

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n  PUBLIC "-//MXW//DTD patent-document XML//EN" 
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc- 
number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new" 
date-produced="20090605" date="20030724">
  <bibliographic-data>
    <publication-reference ucid="US-20030137706-A1" status="new" 
     fvid="76030147">
  <document-id status="new" format="original">
    <country>US</country>
    <doc-number>20030137706</doc-number>
    <kind>A1</kind>
    <date>20030724</date>
  </document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
  <document-id status="new" format="epo">
    <country>US</country>
    <doc-number>18203002</doc-number>
    <kind>A</kind>
    <date>20021204</date>
  </document-id>
</application-reference>
<priority-claims status="new">
  <priority-claim ucid="HU-0000532-A" status="new">
    <document-id status="new" format="epo">
      <country>HU</country>
      <doc-number>0000532</doc-number>
      <kind>A</kind>
      <date>20000207</date>
    </document-id>
  </priority-claim>
  <priority-claim ucid="HU-0100016-W" status="new">
  </abstract>
  <description load-source="us" status="new" lang="EN">
     <heading>TECHNICAL FIELD </heading>
     <p>[0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. </p>
     <heading>BACKGROUND ART </heading>
      <p>[0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time. </p>

最佳答案

我强烈建议使用优秀的 lxml.etree图书馆!它非常快,因为它是 C 库 libxml2 和 libxslt 的包装器。

使用示例:

import lxml.etree  

text = '''\
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE patent-document\n  PUBLIC "-//MXW//DTD patent-document XML//EN" 
"http://www.ir-facility.org/dtds/patents/v1.4/patent-document.dtd">
<patent-document ucid="US-20030137706-A1" country="US" doc-number="20030137706" kind="A1" lang="EN" family-id="10973265" status="new" 
date-produced="20090605" date="20030724">
  <bibliographic-data>
    <publication-reference ucid="US-20030137706-A1" status="new" 
     fvid="76030147">
  <document-id status="new" format="original">
    <country>US</country>
    <doc-number>20030137706</doc-number>
    <kind>A1</kind>
    <date>20030724</date>
  </document-id>
</publication-reference>
<application-reference ucid="US-18203002-A" status="new" is-representative="NO">
  <document-id status="new" format="epo">
    <country>US</country>
    <doc-number>18203002</doc-number>
    <kind>A</kind>
    <date>20021204</date>
  </document-id>
</application-reference>
<priority-claims status="new">
  <priority-claim ucid="HU-0000532-A" status="new">
    <document-id status="new" format="epo">
      <country>HU</country>
      <doc-number>0000532</doc-number>
      <kind>A</kind>
      <date>20000207</date>
    </document-id>
  </priority-claim>
  <description load-source="us" status="new" lang="EN">
     <heading>TECHNICAL FIELD </heading>
     <p>[0001] The object of the invention is a method for the holographic 
     recording of data. In the method a hologram containing the date is 
     recorded in a waveguide layer as an interference between an object beam 
     and a reference beam. The object beam is essentially perpendicular to 
     the plane of the hologram, while the reference beam is coupled in the 
     waveguide. There is also proposed an apparatus for performing the 
     method. The apparatus comprises a data storage medium with a waveguide 
     holographic storage layer, and an optical system for writing and reading 
     the holograms. The optical system comprises means for producing an 
     object beam and a reference beam, and imaging the object beam and a 
     reference beam on the storage medium. </p>
     <heading>BACKGROUND ART </heading>
      <p>[0002] Storage systems realised with tapes stand out from other data 
      storage systems regarding their immense storage capacity. Such systems 
      were used to realise the storage of data in the order of Terabytes. 
      This large storage capacity is achieved partly by the storage density, 
      and partly by the length of the storage tapes. The relative space 
      requirements of tapes are small, because they may be wound up into a 
      very small volume. Their disadvantage is the relatively large random 
      access time. </p>
  </description>
</priority-claims>
</bibliographic-data>
</patent-document>
'''.encode('utf-8') # the library wants bytes so we encode
#  ^^ you don't need this if reading from a file

doc = lxml.etree.fromstring(text)

测试:

>>> print(doc.xpath('//patent-document/@family-id'))
['10973265']
>>> print(doc.xpath('//patent-document/@date'))
['20030724']

关于python - 如何使用 Python 解析复杂的 XML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51028400/

相关文章:

python - Kivy中根据窗口大小更改小部件的大小和位置

python - 嵌套构造函数。为什么需要它?

php - 无法在 php 中使用 xpath 获取值

java - 使用 Jsoup 替换标签内的文本

css - 将 unicode 字符串插入 CleverCSS

python - 在 NLTK 3.0 中使用 Wordnet 从 Synset 中提取单词

python - XML 和 Python : Get the namespaces declared in root element

python xml2dict 复杂的 xml

parsing - 编程语言语法

python - 属性错误: module 'tensorflow' has no attribute 'lite' in Keras model to Tensorflow Lite convertion - Python