performance - 耶拿，读模型需要很长时间

我试图了解给定的网址是否是本体。 (尝试将其读入耶拿)

通常，不可解析的页面(404、HTML 等)会引发各种异常，并且有效的 RDF 由 Jena 进行解析。但有些(无效)文件需要 5-10 分钟才能解析!没有高 CPU 或 RAM 使用率，什么都没有! model.read() 永远不会结束。 (有一次我等了一个小时!)

try {
    Model model = ModelFactory.createOntologyModel();
    model.read("http://dbpedia.org/page/Pizza_Deliverance"); // <- THIS LINE NEVER FINISHES!

    // It is an ontology.
} catch (Exception e) {
    // Jena can't parse it
}

另一个挂起的代码示例:(这次 Model.read 由输入流提供)

// In is an InputStream that holds http://dbpedia.org/page/Pizza_Deliverance
Model model;
try {
  model = ModelFactory.createOntologyModel();
  model.read(in, baseUri); // <- THIS LINE NEVER ENDS.
} catch (Exception e) {
  Logger.error("Error parsing file as ontology: " + baseUri, e);
  return null;
}
return model;

这是堆栈跟踪(如果我等待获取)

java.lang.NullPointerException: null
at com.hp.hpl.jena.rdf.arp.impl.XMLHandler.endElement(XMLHandler.java:133) ~[jena-core-2.10.0.jar:2.10.0]
at org.apache.xerces.parsers.AbstractSAXParser.endElement(AbstractSAXParser.java:598) ~[xercesImpl-2.11.0.jar:na]
at org.apache.xerces.impl.XMLNamespaceBinder.handleEndElement(XMLNamespaceBinder.java:835) ~[xercesImpl-2.11.0.jar:na]
at org.apache.xerces.impl.XMLNamespaceBinder.endElement(XMLNamespaceBinder.java:599) ~[xercesImpl-2.11.0.jar:na]
at org.apache.xerces.impl.dtd.XMLDTDValidator.endNamespaceScope(XMLDTDValidator.java:2099) ~[xercesImpl-2.11.0.jar:na]
at org.apache.xerces.impl.dtd.XMLDTDValidator.handleEndElement(XMLDTDValidator.java:2050) ~[xercesImpl-2.11.0.jar:na]

我的问题:

我错过了什么吗？误用等
是否有其他方法可以检查给定页面(或字符串)是否可解析为本体？

编辑: 我解决了这个问题，方法是在另一个线程中执行解析工作，如果解析时间太长则终止线程。但这并不是一个真正的解决方案。

编辑2:

我研究了源代码和堆栈跟踪的长时间运行的代码。问题出在 org.apache.xerces.parsers.DTDConfiguration.parse(boolean) 中，如果这对您有任何意义。

最佳答案

您正在尝试将 HTML 页面读取到 Jena 模型。换句话说，您使用 application/rdf+xml 媒体类型发送对此 URI 的 HTTP 请求。 (有关链接数据中内容协商的更多信息，请参阅http://wifo5-03.informatik.uni-mannheim.de/bizer/pub/LinkedDataTutorial/#Terminology)网络上的链接数据资源有一个重定向机制。可能 DBpedia 重定向会在那里产生问题，例如无限重定向循环，或者 DBpedia 底层的 virtuoso RDF 存储可能存在问题。您应该向 dbpedia 邮件列表询问这个问题，他们可以帮助您。

作为建议，如果您需要检查给定的 URI 是否返回链接的数据资源描述，您可以为几种不同的媒体类型发送简单的 HTTP get，例如 application/rdf+xml、application/text+n3，等等，如果在指定时间内收到任何响应，则使用 Jena 解析检索到的响应。示例如下:

HttpGet get = new HttpGet();
get.setURI(URI.create("http://dbpedia.org/resource/Pizza_Deliverance"));
get.setHeader("Accept", "text/n3");
HttpClient client = new DefaultHttpClient();
HttpResponse response = client.execute(get);
HttpEntity entity = response.getEntity();
System.out.println(EntityUtils.toString(entity));

此代码返回一个描述 Pizza Deliverance 资源的 N3 文档。如果您为您的 http://dbpedia.org/page/Pizza_Deliverance 尝试此代码URI 具有“application/rdf+xml”媒体类型，您将收到 HTTP 406 异常。此错误可能意味着您需要了解 URI 的类型。

关于performance - 耶拿，读模型需要很长时间，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/15926195/

performance - 耶拿，读模型需要很长时间

上一篇：asp.net - View 中的通用类型

下一篇：regex - VB.NET 中的正则表达式