python - 使用带有标签的 ElementTree 从 XML 检索文本时遇到问题

现在我有一些代码使用 Biopython 和 NCBI 的“Entrez”API 从 Pubmed Central 获取 XML 字符串。我正在尝试使用 ElementTree 解析 XML 以获取页面中的文本。虽然当我从站点本身抓取 lxml 数据时，我有 BeautifulSoup 代码来执行此操作，但我正在切换到 NCBI API，因为抓取工具显然是禁忌。但现在有了来自 NCBI API 的 XML，我发现 ElementTree 非常不直观，并且确实需要一些帮助才能使其正常工作。当然，我看过其他帖子，但其中大多数都涉及 namespace ，就我而言，我只想使用 XML 标签来获取信息。甚至 ElementTree 文档也没有讨论这个(据我所知)。谁能帮我找出在某些标签内而不是在某些命名空间内获取信息的语法？

这是一个例子。注意:我使用Python 3.4

XML 的小片段:

      <sec sec-type="materials|methods" id="s5">
      <title>Materials and Methods</title>
      <sec id="s5a">
        <title>Overgo design</title>
        <p>In order to screen the saltwater crocodile genomic BAC library described below, four overgo pairs (forward and reverse) were designed (<xref ref-type="table" rid="pone-0114631-t002">Table 2</xref>) using saltwater crocodile sequences of MHC class I and II from previous studies <xref rid="pone.0114631-Jaratlerdsiri1" ref-type="bibr">[40]</xref>, <xref rid="pone.0114631-Jaratlerdsiri3" ref-type="bibr">[42]</xref>. The overgos were designed using OligoSpawn software, with a GC content of 50&#x2013;60% and 36 bp in length (8-bp overlapping) <xref rid="pone.0114631-Zheng1" ref-type="bibr">[77]</xref>. The specificity of the overgos was checked against vertebrate sequences using the basic local alignment search tool (BLAST; <ext-link ext-link-type="uri" xlink:href="http://www.ncbi.nlm.nih.gov/">http://www.ncbi.nlm.nih.gov/</ext-link>).</p>
    <table-wrap id="pone-0114631-t002" orientation="portrait" position="float">
      <object-id pub-id-type="doi">10.1371/journal.pone.0114631.t002</object-id>
      <label>Table 2</label>
      <caption>
        <title>Four pairs of forward and reverse overgos used for BAC library screening of MHC-associated BACs.</title>
      </caption>
      <alternatives>
        <graphic id="pone-0114631-t002-2" xlink:href="pone.0114631.t002"/>
        <table frame="hsides" rules="groups">
          <colgroup span="1">
            <col align="left" span="1"/>
            <col align="center" span="1"/>
          </colgroup>

对于我的项目，我需要“p”标记中的所有文本(不仅仅是 XML 的这个片段，而是整个 XML 字符串)。

现在，我已经知道我可以将整个 XML 字符串放入 ElementTree 对象

>>> import xml.etree.ElementTree as ET
>>> tree = ET.ElementTree(ET.fromstring(xml_string))
>>> root = ET.fromstring(xml_string)

现在，如果我尝试使用这样的标签获取文本:

 >>> text = root.find('p')
 >>> print("".join(text.itertext()))

或

 >>> text = root.get('p').text

我无法提取我想要的文本。据我所知，这是因为我使用标签“p”作为参数而不是命名空间。

虽然我觉得获取 XML 文件中“p”标记中的所有文本对我来说应该非常简单，但我目前无法做到这一点。请让我知道我缺少什么以及如何解决这个问题。谢谢!

--- 编辑 ---

所以现在我知道我应该使用此代码来获取“p”标签中的所有内容:

>>> text = root.find('.//p')
>>> print("".join(text.itertext()))

尽管我正在使用 itertext()，但它只返回第一个“p”标签的内容，而不查看任何其他内容。 itertext() 只在标签内迭代吗？文档似乎表明它也会迭代所有标签，所以我不确定为什么它只返回一行而不是所有“p”标签下的所有文本。

---- 最终编辑 --

我发现 itertext() 只能在一个标签内工作，而 find() 只返回第一项。为了获得我想要的完整文本，我必须使用 findall()

>>> all_text = root.findall('.//p')
>>> for texts in all_text:
    print("".join(texts.itertext()))

最佳答案

root.get() 是错误的方法，因为它将检索根标签的属性而不是子标签。 root.find() 是正确的，因为它将找到第一个匹配的子标签(或者可以使用 root.findall() 来查找所有匹配的子标签)。

如果您不仅想查找直接子标签，还想查找间接子标签(如您的示例中所示)，则 root.find/root.findall 中的表达式必须为XPath 的子集(请参阅 https://docs.python.org/2/library/xml.etree.elementtree.html#xpath-support )。在您的情况下，它是 './/p':

  text = root.find('.//p')
  print("".join(text.itertext()))

关于python - 使用带有标签的 ElementTree 从 XML 检索文本时遇到问题，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37554842/

python - 使用带有标签的 ElementTree 从 XML 检索文本时遇到问题

上一篇：python - 更改 pandas 数据帧多重索引中的值

下一篇：python - 如何使用 Pandas 处理行对并在没有字典的情况下保留 ID 列？