python - 如何在 Python 中验证具有多个命名空间的 XML?

标签 python xml validation xsd

我正在尝试在 Python 2.7 中编写一些单元测试以验证我对 OAI-PMH 模式所做的一些扩展:http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd

我遇到的问题是具有多个嵌套 namespace 的业务是由上述 XSD 中的此规范引起的:

<complexType name="metadataType">
    <annotation>
        <documentation>Metadata must be expressed in XML that complies
        with another XML Schema (namespace=#other). Metadata must be 
        explicitly qualified in the response.</documentation>
    </annotation>
    <sequence>
        <any namespace="##other" processContents="strict"/>
    </sequence>
</complexType>

这是我正在使用的代码片段:

import lxml.etree, urllib2

query = "http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm"
schema_file = file("../schemas/OAI/2.0/OAI-PMH.xsd", "r")
schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)

request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
response_doc = etree.fromstring(body)

try:
    oaischema.assertValid(response_doc)
except etree.DocumentInvalid as e:
     line = 1;
     for i in body.split("\n"):
        print "{0}\t{1}".format(line, i)
        line += 1
     print(e.message)

我最终遇到以下错误:

AssertionError: http://localhost:8080/OAI-PMH?verb=GetRecord&by_doc_ID=false&metadataPrefix=nsdl_dc&identifier=http://www.purplemath.com/modules/ratio.htm
Element '{http://www.openarchives.org/OAI/2.0/oai_dc/}oai_dc': No matching global element declaration available, but demanded by the strict wildcard., line 22

我理解错误,因为架构要求严格验证元数据元素的子元素,示例 xml 就是这样做的。

现在我已经用 Java 编写了一个可以工作的验证器——但是如果用 Python 编写它会很有帮助,因为我正在构建的解决方案的其余部分是基于 Python 的。为了让我的 Java 变体工作,我必须让我的 DocumentFactory 命名空间感知,否则我会得到同样的错误。我没有在 python 中找到任何可以正确执行此验证的工作示例。

当我的示例文档使用 Python 验证时,有没有人知道如何获得具有多个嵌套命名空间的 XML 文档?

这是我要验证的示例 XML 文档:

<?xml version="1.0" encoding="UTF-8"?> 
<OAI-PMH xmlns="http://www.openarchives.org/OAI/2.0/" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/
     http://www.openarchives.org/OAI/2.0/OAI-PMH.xsd">
  <responseDate>2002-02-08T08:55:46Z</responseDate>
  <request verb="GetRecord" identifier="oai:arXiv.org:cs/0112017"
       metadataPrefix="oai_dc">http://arXiv.org/oai2</request>
  <GetRecord>
   <record> 
    <header>
      <identifier>oai:arXiv.org:cs/0112017</identifier> 
      <datestamp>2001-12-14</datestamp>
      <setSpec>cs</setSpec> 
      <setSpec>math</setSpec>
    </header>
    <metadata>
      <oai_dc:dc 
     xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/" 
     xmlns:dc="http://purl.org/dc/elements/1.1/" 
     xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
     xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/ 
     http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
    <dc:title>Using Structural Metadata to Localize Experience of 
          Digital Content</dc:title> 
    <dc:creator>Dushay, Naomi</dc:creator>
    <dc:subject>Digital Libraries</dc:subject> 
    <dc:description>With the increasing technical sophistication of 
        both information consumers and providers, there is 
        increasing demand for more meaningful experiences of digital 
        information. We present a framework that separates digital 
        object experience, or rendering, from digital object storage 
        and manipulation, so the rendering can be tailored to 
        particular communities of users.
    </dc:description> 
    <dc:description>Comment: 23 pages including 2 appendices, 
        8 figures</dc:description> 
    <dc:date>2001-12-14</dc:date>
      </oai_dc:dc>
    </metadata>
  </record>
 </GetRecord>
</OAI-PMH>

最佳答案

lxml's doc on validation 中找到这个:

>>> schema_root = etree.XML('''\
...   <xsd:schema xmlns:xsd="http://www.w3.org/2001/XMLSchema">
...     <xsd:element name="a" type="xsd:integer"/>
...   </xsd:schema>
... ''')
>>> schema = etree.XMLSchema(schema_root)

>>> parser = etree.XMLParser(schema = schema)
>>> root = etree.fromstring("<a>5</a>", parser)

所以,也许,您需要的是这个? (请参阅最后两行。):

schema_doc = etree.parse(schema_file)
oaischema = etree.XMLSchema(schema_doc)

request = urllib2.Request(query, headers=xml_headers)
response = urllib2.urlopen(request)
body = response.read()
parser = etree.XMLParser(schema = oaischema)
response_doc = etree.fromstring(body, parser)

关于python - 如何在 Python 中验证具有多个命名空间的 XML?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/5332985/

相关文章:

python - 如何让 selenium 在 scraperwiki 上工作

c# - 使用过滤器使用 Linq to XML 在 CDATA 上提取内部值

xml - 如何在 Silverlight 中计算 XPath 表达式?

数字的javascript验证

python - 压缩文件编码错误

python - Pandas :在群体内规范化

python - 打印以查看 BigQuery 查询的结果

c# - 使用多个 http 请求-响应

java - Spring验证非空元素的字符串列表

javascript - JOI 验证字符串在 , 和