python - 当对象处于事件状态时，lxml 对象标识符似乎会被重用

我在 Ubuntu 上使用 Python 3.6.8 和 lxml-4.3.4。

我所追求的是将大型 XML 内容分解为片段文件，以便更容易工作，并保留已解析元素的源文件名和行号，以便我可以形成有用的解析时错误消息。当 XML 格式良好时，我将引发的错误特定于我的应用程序。

以下是一些示例 XML 片段文件:

one.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
  <one>1</one>
  <one>11</one>
  <one>111</one>
  <one>1111</one>
</data>

两个.xml:

<?xml version="1.0" encoding="UTF-8" ?>
<data>
  <two>2</two>
  <two>22</two>
  <two>222</two>
  <two>2222</two>
  <two>22222</two>
  <two>222222</two>
</data>

我的计划是使用 lxml 来解析每个文件，然后简单地将元素树拼接在一起以获得单个根。然后我的程序的其余部分可以消耗整个树。

如果元素的内容对我的应用程序无效，我想给出它来自的片段文件和行号。 lxml 已经有行号，但没有源文件。所以我想追踪这一点。请注意，我决定不尝试扩展 lxml 的类，而是使用元素对象标识符到片段文件的映射，我希望即使 lxml 重构其源代码，它也是持久的。

from lxml import etree

# Too much data for one source file, so let's define
# fragment files, each of which looks like a stand
# alone XML file w/ header and root <data>...</data>
# to make syntax highlighters happy.
xmlFragmentFiles = ['one.xml', 'two.xml']

# lxml tracks line number for parsed elements, but not
# source filename. Rather than try to extend the deep
# inner classes of the module, let's try keeping a map
# from parsed elements to fragment file they just came
# from.
element2fragment = {}
def AddFragmentFileToETree(element, fragmentFile):
  # The entry we're just about to add.
  print('%s:%s' % (id(element), fragmentFile))
  element2fragment[id(element)] = fragmentFile
  for child in element:
    AddFragmentFileToETree(child, fragmentFile)

# Fabricate a root that we'll stitch each fragment's
# children onto as we parse them.
root = etree.fromstring('<data></data>')
AddFragmentFileToETree(root, 'Programmatic Root')

for filename in xmlFragmentFiles:
  # It doesn't seem to matter whether we create a new
  # parser per fragment, or reuse a single parser.
  parser = etree.XMLParser(remove_comments=True)
  subroot = etree.parse(filename, parser).getroot()  
  for child in subroot:
    root.append(child)
    AddFragmentFileToETree(child, filename)

# Clearly the final desired tree is here, and presumably
# all the subelements we care about are reachable from
# the programmatic root meaning the objects are still
# live, so why did any object identifier get reused?
print(etree.tostring(
  root, encoding=str, pretty_print=True))

当我运行这个程序时，我可以看到整个所需的树以及片段文件的每个不同元素都带有 pretty-print 。但是，查看我们插入的映射条目，我们可以清楚地看到对象正在被重用!？

140611035114248:Programmatic Root
140611035114056:one.xml <-- see here
140611035114376:one.xml
140611035114440:one.xml
140611035114056:one.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and here
140611035114312:two.xml
140611035114120:two.xml
140611035114056:two.xml <-- and again
<data><one>1</one>
  <one>11</one>
  <one>111</one>
  <one>1111</one>
<two>2</two>
  <two>22</two>   <-- yet all distinct elements still exist
  <two>222</two>
  <two>2222</two>
  <two>22222</two>
  <two>222222</two>
</data>

有什么关于这些对象的建议吗？也许我应该远离 lxml，它是一个 C 库？我切换到 lxml 只是为了行号跟踪。

最佳答案

我决定继续扩展/自定义解析器......并找到了这个原始问题的答案。

https://lxml.de/element_classes.html

他们警告说 python Element 代理是无状态的，

Element instances are created and garbage collected at need, so there is normally no way to predict when and how often a proxy is created for them.

他们接着说，如果你真的需要它们来携带状态，你必须为每个保留一个实时引用:

proxy_cache = list(root.iter())

这对我有用。我认为当元素具有对子元素的实时引用时，根就足够了，但代理显然是根据 C 中维护的真实树的需要出现的。

关于python - 当对象处于事件状态时，lxml 对象标识符似乎会被重用，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/57059021/

python - 当对象处于事件状态时，lxml 对象标识符似乎会被重用

上一篇：python - 如何在aiohttp中重定向post请求？

下一篇：python - 为什么keras模型训练后会变大？