python - 元素树 iter() 正在跳过随机元素

标签 python xml parsing text elementtree

我试图在 Python 中使用 Element Tree 的 iterparse() 和 iter() 函数来解析 XML 文件。这是 Google 云端硬盘中文件的链接:https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing .

XML 文件是法庭案件数据的汇编;它被分解成一系列带有标签“n-document”的元素,每个元素都包含子元素,这些子元素包含有关特定法庭案件的数据。我正在尝试提取所有摘要描述。代码的简化版本如下:

import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv

for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
    if event == "start":
        if elem.tag == "docket.entry":
            for element in elem.iter():
                print element.tag
                if element.text != None:
                    print element.text
                if element.tail != None:
                    print element.tail
                    print "from tail"
    elem.clear()

问题是在第一种情况下(1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL),编号为 25 的摘要描述(它们按降序编号)缺少带有标签的元素的文本和尾部“网关.i​​mage.link”。具体来说,这是我得到的输出。一秒钟后我取消了构建并向上滚动到控制台的最顶部:

docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
 TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail

在条目号 25(上面显示的输出底部的第二个)下,它说:

25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link

问题是,如果您查看 XML 文件本身,您会发现还有一个带有标签“gateway.image.link”的元素紧跟在“image.gateway.link”之后,带有文本和尾部内容,但由于某种原因,iter() 函数没有接收到它。奇怪的是,大多数其他摘要描述还包含带有标签“image.gateway.link”的元素,紧接着是带有标签“gateway.image.link”的元素,正如您从条目号 24(以及其余部分)中看到的那样它们),并且 iter() 函数识别那些但不识别这个。这是从我粘贴到上面的链接的 Google Drive 文档中摘录的 XML 代码:

<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS&apos; MOTION TO DISMISS AND DENYING PLAINTIFF&apos;S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS&apos; MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS&apos; DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS&apos; NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>

当我完全按照上面粘贴的方式在此特定摘录上运行我的 Python 脚本时,它会获取缺少的元素。但是当我在整个 XML 文件上运行脚本时,它不会,如前所示。显然,该摘录缺少其上方和下方的许多元素,但我看不出这将如何影响 iter() 函数,因为我没有分解“docket.entry”元素/子元素,并且这就是我代码中的 for 循环每次都应该经历的(我认为)。

问题不仅限于条目号 25——这里和那里还有一些其他提取的摘要描述缺少子元素,但我无法辨别任何模式——我什至无法分辨导致问题的条目号 25 和条目号 24 之间的差异。有人可以帮忙吗?

最佳答案

您正在尝试在开始事件中处理元素的子元素,但 iterparse 的工作方式并不能保证它们已被读取。

documentation对此有说明:

Note:

iterparse() only guarantees that it has seen the “>” character of a starting tag when it emits a “start” event, so the attributes are defined, but the contents of the text and tail attributes are undefined at that point. The same applies to the element children; they may or may not be present.

If you need a fully populated element, look for “end” events instead.

如果你想处理一个元素的 child ,你需要在结束事件上做,否则不能保证元素的内容是可用的。

描述了您获得任何内容的原因 here :

Note:

The tree builder and the event generator are not necessarily synchronized; the latter usually lags behind a bit. This means that when you get a “start” event for an element, the builder may already have filled that element with content. You cannot rely on this, though — a “start” event can only be used to inspect the attributes, not the element content. For more details, see this message.

关于python - 元素树 iter() 正在跳过随机元素,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/31847982/

相关文章:

python - 可视化解析树 - 嵌套列表到 python 中的树

Python MySQLdb 从另一个数据库中选择并插入

android - 如何在 Material 按钮上写多行文本?

c# - 如何根据父节点中的属性删除特定节点及其子节点?

c++ - 字符数组到单个整数

python - 使用 simpleparse EBNF 解析 nuke 脚本

python - 使用 PyTables 存储图像和元数据

python - 根据字典中的值过滤 Pythons Pandas DataFrame

python - 如何检查pandas数据框中的 bool 条件

java - 未定义名为 'myUserDetailsService' 的 bean