从 HTTP 输入流构建时 Javax xml 解析器卡住

标签 java html xml parsing well-formed

我正在尝试打开到网站的 HTTP 连接并将 html 解析为 org.w3c.dom.Document 类。我可以打开 HTTP 连接并将网页输出到控制台,但是如果我将 InputStream 对象传递给 XML 解析器,它会挂起一分钟并输出错误

[Fatal Error] :108:55: Open quote is expected for attribute "{1}" associated with an  element type  "onload".

代码:

private static Document getInputStream(String url) throws IOException, SAXException, ParserConfigurationException
{
  System.out.println(url);
  URL webUrl = new URL(url);
  URLConnection connection = webUrl.openConnection();
  connection.setConnectTimeout(60 * 1000);
  connection.setReadTimeout(60 * 1000);

  InputStream stream = connection.getInputStream();

  DocumentBuilderFactory domFactory = DocumentBuilderFactory.newInstance();
  domFactory.setNamespaceAware(true);
  DocumentBuilder builder = domFactory.newDocumentBuilder();
  Document doc = builder.parse(stream); // This line is hanging
  return doc;
}

暂停时的堆栈跟踪:

Thread [main] (Suspended)   
    SocketInputStream.socketRead0(FileDescriptor, byte[], int, int, int) line: not available [native method]    
    SocketInputStream.read(byte[], int, int) line: not available    
    BufferedInputStream.fill() line: not available  
    BufferedInputStream.read1(byte[], int, int) line: not available 
    BufferedInputStream.read(byte[], int, int) line: not available  
    HttpClient.parseHTTPHeader(MessageHeader, ProgressSource, HttpURLConnection) line: not available    
    HttpClient.parseHTTP(MessageHeader, ProgressSource, HttpURLConnection) line: not available  
    HttpURLConnection.getInputStream() line: not available  
    XMLEntityManager.setupCurrentEntity(String, XMLInputSource, boolean, boolean) line: not available   
    XMLEntityManager.startEntity(String, XMLInputSource, boolean, boolean) line: not available  
    XMLEntityManager.startDTDEntity(XMLInputSource) line: not available 
    XMLDTDScannerImpl.setInputSource(XMLInputSource) line: not available    
    XMLDocumentScannerImpl$DTDDriver.dispatch(boolean) line: not available  
    XMLDocumentScannerImpl$DTDDriver.next() line: not available 
    XMLDocumentScannerImpl$PrologDriver.next() line: not available  
    XMLNSDocumentScannerImpl(XMLDocumentScannerImpl).next() line: not available 
    XMLNSDocumentScannerImpl.next() line: not available 
    XMLNSDocumentScannerImpl(XMLDocumentFragmentScannerImpl).scanDocument(boolean) line: not available  
    XIncludeAwareParserConfiguration(XML11Configuration).parse(boolean) line: not available 
    XIncludeAwareParserConfiguration(XML11Configuration).parse(XMLInputSource) line: not available  
    DOMParser(XMLParser).parse(XMLInputSource) line: not available  
    DOMParser.parse(InputSource) line: not available    
    DocumentBuilderImpl.parse(InputSource) line: not available  
    DocumentBuilderImpl(DocumentBuilder).parse(InputStream) line: not available 
    MSCommunicator.getInputStream(String) line: 45  
    MSCommunicator.getGamePageFromForum(int, int, int) line: 70 
    MSCommunicator.getGamePageFromForum(int, int) line: 57  
    Game.<init>(int, int) line: 21  
    MSCommunicator.main(String[]) line: 26  

最佳答案

您不能真的只期望将 HTML 解析为 XML DOM 树。它不一定是有效的 XML。您可能需要先清理它。查看此问题的答案:

Reading HTML file to DOM tree using Java

关于从 HTTP 输入流构建时 Javax xml 解析器卡住,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/12890653/

相关文章:

java - 告诉 SAX 解析器忽略无效字符?

java - SOAP : NAMESPACE_ERR: An attempt is made to create or change an object in a way which is incorrect with regard to namespaces

java - 使用 dao 接口(interface)和实现实现通用抽象实体类

java - 从其他类中的ArrayList获取大小并比较属性值

javascript - Bootstrap 两列布局和 ng-repeat (angularjs)

xml - 基于多个元素(包括一个可选元素)的 XSD 唯一约束

java - jsf和java中同步方法的最佳实践

fadeOut 之前的 jQuery HTML 更改

javascript - 如何从 JavaScript 中的输入获取数字?

java - TOMCAT 在更新 web.xml 时显示 java.lang.IndexOutOfBoundsException