python - 在 lxml 中使用第二个命名空间时从元素中提取值

当使用一个命名空间时，我能够从元素中提取值(使用 python 2.7 中的 lxml)。但是，我不知道如何在使用第二个命名空间时提取值。我想提取 //cc-cpl:MainClosedCaption/Id 中的值，但我不断收到 lxml.etree.XPathEvalError: Invalid expression 错误。具体来说，我尝试从示例 xml 中提取的值是 urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d

这是 xml 的片段(来自数字电影包):

<?xml version="1.0" encoding="UTF-8"?>
<CompositionPlaylist xmlns="http://www.digicine.com/PROTO-ASDCP-CPL-20040511#">
    <Reel>
      <Id>urn:uuid:58cf368f-ed30-40d8-9258-dd7572035b69</Id>
        <MainPicture>
          <Id>urn:uuid:afe91f7a-6451-4b9f-be2e-345f9a28da6d</Id>
        </MainPicture>
        <cc-cpl:MainClosedCaption xmlns:cc-cpl="http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#">
          <Id>urn:uuid:6ca58b51-9116-4131-8652-feaed20dca0d</Id>
        </cc-cpl:MainClosedCaption>
    </Reel>
</CompositionPlaylist>

这是一个有效的代码示例:

from lxml import etree
cpl_parse = etree.parse('filename.xml')
pkl_namespace = cpl_parse.xpath('namespace-uri(.)') 
xmluuid =  cpl_parse.xpath('//ns:MainPicture/ns:Id',namespaces={'ns': pkl_namespace})
for i in xmluuid:
    print i.text

当我尝试指定以下 xpath 时://ns:MainClosedCaption/ns:Id - 我最终遇到错误。

当我指定命名空间时: pkl_namespace = 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#"'

我收到lxml.etree.XPathEvalError:无效表达式错误

我知道这是一个愚蠢的尝试，但以下内容产生了相同的错误: '//ns:cc-cpl:MainClosed Caption/ns:cc-cpl:Id'

我尝试将两个命名空间包含在字典中，如以下答案所示:https://stackoverflow.com/a/36227869/2188572 ，虽然我没有收到任何错误，但最终没有提取任何值。这是我的字典:

namespaces = {
    'ns': 'http://www.digicine.com/PROTO-ASDCP-CPL-20040511#',
    'ns2': 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#',
}

和我的命令:

xmluuid =  cpl_parse.xpath('//ns:AssetList/ns2:MainClosedCaption/ns2:Id',namespaces=namespaces)

我找到了这个，Extracting nested namespace from a xml using lxml这实际上与我正在处理的 xml 类型完全相同，但他的请求是获取 namespace URL，而不是元素的实际值。

编辑: 使用上一个答案中的方法来提取 namespace ，我尝试了以下操作，但得到了相同的错误:

from lxml import etree
import sys
filename = sys.argv[1]

cpl_parse = etree.parse(filename)
pkl_namespace = etree.QName(cpl_parse.find('.//{*}MainClosedCaption')).namespace
print pkl_namespace
xmluuid =  cpl_parse.xpath('//ns:cc-cpl:MainClosedCaption/ns:cc-cpl:Id',namespaces={'ns': pkl_namespace})
for i in xmluuid:
    print i.text

以下是完整的错误:

Traceback (most recent call last):
  File "sub.py", line 8, in <module>
    xmluuid =  cpl_parse.xpath('//ns:cc-cpl:MainClosedCaption/ns:cc-cpl:Id',namespaces={'ns': pkl_namespace})
  File "lxml.etree.pyx", line 2115, in lxml.etree._ElementTree.xpath (src/lxml/lxml.etree.c:57654)
  File "xpath.pxi", line 370, in lxml.etree.XPathDocumentEvaluator.__call__ (src/lxml/lxml.etree.c:146564)
  File "xpath.pxi", line 238, in lxml.etree._XPathEvaluatorBase._handle_result (src/lxml/lxml.etree.c:144962)
  File "xpath.pxi", line 224, in lxml.etree._XPathEvaluatorBase._raise_eval_error (src/lxml/lxml.etree.c:144817)
lxml.etree.XPathEvalError: Invalid expression

最佳答案

MainClosedCaption 中的 Id 元素属于 2004 命名空间。只有属性 xmlns="..." 可以更改默认命名空间； xmlns:something="..." 形式的属性仅添加必须显式声明的命名空间。

试试这个:

from lxml import etree
cpl_parse = etree.parse('filename.xml')
xmluuid = cpl_parse.xpath('//proto2007:MainClosedCaption/proto2004:Id', namespaces={
    'proto2004': 'http://www.digicine.com/PROTO-ASDCP-CPL-20040511#',
    'proto2007': 'http://www.digicine.com/PROTO-ASDCP-CC-CPL-20070926#',
})
for i in xmluuid:
    print(i.text)

关于python - 在 lxml 中使用第二个命名空间时从元素中提取值，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/37038148/

python - 在 lxml 中使用第二个命名空间时从元素中提取值

上一篇：python - 在Python中从串联的gzip读取多个文件

下一篇：python - 如何确保 pandas.to_csv() 的行为不改变