java - 如何构建 HTML org.w3c.dom.Document？

documentation of the Document interface接口(interface)描述如下:

The Document interface represents the entire HTML or XML document.

javax.xml.parsers.DocumentBuilder构建 XML Document s。但是，我无法找到构建 Document 的方法。那是一个 HTML Document !

我想要一个 HTML Document因为我正在尝试构建一个文档，然后将其传递给一个需要 HTML 的库 Document .该库使用 Document#getElementsByTagName(String tagname)以不区分大小写的方式，这适用于 HTML，但不适用于 XML。

我环顾四周，没有找到任何东西。项目如 How to convert an Html source of a webpage into org.w3c.dom.Document in java?实际上没有答案。

最佳答案

您似乎有两个明确的要求:

您需要将 HTML 表示为 org.w3c.dom.Document .

您需要 Document#getElementsByTagName(String tagname)以不区分大小写的方式操作。

如果您尝试使用 org.w3c.dom.Document 处理 HTML ，那么我假设您正在使用某种形式的 XHTML。因为诸如 DOM 之类的 XML API 需要格式良好的 XML。 HTML 不一定是格式良好的 XML，但 XHTML 是格式良好的 XML。即使您正在使用 HTML，在尝试通过 XML 解析器运行它之前，您也必须进行一些预处理以确保它是格式良好的 XML。首先使用 HTML 解析器解析 HTML 可能更容易，例如 jsoup ，然后构建您的 org.w3c.dom.Document通过遍历 HTML 解析器生成的树(在 jsoup 的情况下为 org.jsoup.nodes.Document)。

有一个 org.w3c.dom.html.HTMLDocument 接口(interface)，扩展 org.w3c.dom.Document .我发现的唯一实现是在 Xerces-j 中(2.11.0) 形式为 org.apache.html.dom.HTMLDocumentImpl .起初这看起来很有希望，但是经过仔细检查，我们发现存在一些问题。

1. 没有一种清晰、“干净”的方式来获取实现 org.w3c.dom.html.HTMLDocument 的对象的实例。界面。

使用 Xerces 我们通常会得到一个 Document对象使用 DocumentBuilder以下列方式:

DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
DocumentBuilder builder = factory.newDocumentBuilder();
Document doc = builder.newDocument();
//or doc = builder.parse(xmlFile) if parsing from a file

或者使用 DOMImplementation种类:

DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
DOMImplementationLS impl = (DOMImplementationLS)registry.getDOMImplementation("LS");
LSParser lsParser = impl.createLSParser(DOMImplementationLS.MODE_SYNCHRONOUS, null);
Document document = lsParser.parseURI("myFile.xml");

在这两种情况下，我们纯粹使用 org.w3c.dom.*获取Document的接口(interface)目的。

我为 HTMLDocument 找到的最接近的等价物是这样的:

HTMLDOMImplementation htmlDocImpl = HTMLDOMImplementationImpl.getHTMLDOMImplementation();
HTMLDocument htmlDoc = htmlDocImpl.createHTMLDocument("My Title");

这要求我们直接实例化内部实现类，使我们的实现依赖于 Xerces。

(注意:我还看到 Xerces 也有一个内部 HTMLBuilder(它实现了已弃用的 DocumentHandler )，据说可以生成一个 HTMLDocument using a SAX parser, but I didn't bother looking into it. )

2. org.w3c.dom.html.HTMLDocument不会生成正确的 XHTML。

虽然，您可以搜索 HTMLDocument树使用 getElementsByTagName(String tagname)以不区分大小写的方式，所有元素名称都在内部以全部大写形式保存。但是 XHTML 元素和属性名称应该在 all lowercase 中. (这可以通过遍历整个文档树并使用 Document 的 renameNode() 方法将所有元素的名称更改为小写来解决。)

此外，XHTML 文档应该有一个正确的 DOCTYPE declaration和 xmlns declaration for the XHTML namespace .似乎没有一种直接的方法可以在 HTMLDocument 中设置它们。 (除非您对内部 Xerces 实现进行一些摆弄)。

3. org.w3c.dom.html.HTMLDocument文档很少，接口(interface)的 Xerces 实现似乎不完整。

我没有搜索整个互联网，而是我找到的唯一文档 HTMLDocument是之前链接的 JavaDocs，以及 Xerces 内部实现的源代码中的注释。在这些评论中，我还发现界面的几个不同部分没有实现。 (旁注:我真的觉得 org.w3c.dom.html.HTMLDocument 界面本身并没有真正被任何人使用，而且它本身可能是不完整的。)

由于这些原因，我认为最好避免使用 org.w3c.dom.html.HTMLDocument并尽我们所能用 org.w3c.dom.Document .我们可以做什么？

一种方法是扩展 org.apache.xerces.dom.DocumentImpl (扩展 org.apache.xerces.dom.CoreDocumentImpl 实现 org.w3c.dom.Document )。这种方法不需要太多代码，但它仍然使我们的实现依赖于 Xerces，因为我们正在扩展 DocumentImpl .在我们的 MyHTMLDocumentImpl ，我们只是在元素创建和搜索时将所有标签名称转换为小写。这将允许使用 Document#getElementsByTagName(String tagname)以不区分大小写的方式。
MyHTMLDocumentImpl :

import org.apache.xerces.dom.DocumentImpl;
import org.apache.xerces.dom.DocumentTypeImpl;
import org.w3c.dom.DOMException;
import org.w3c.dom.Document;
import org.w3c.dom.DocumentType;
import org.w3c.dom.Element;
import org.w3c.dom.Node;
import org.w3c.dom.NodeList;

//a base class somewhere in the hierarchy implements org.w3c.dom.Document
public class MyHTMLDocumentImpl extends DocumentImpl {

    private static final long serialVersionUID = 1658286253541962623L;


    /**
     * Creates an Document with basic elements required to meet
     * the <a href="http://www.w3.org/TR/xhtml1/#strict">XHTML standards</a>.
     * <pre>
     * {@code
     * <?xml version="1.0" encoding="UTF-8"?>
     * <!DOCTYPE html 
     *     PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
     *     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
     * <html xmlns="http://www.w3.org/1999/xhtml">
     *     <head>
     *         <title>My Title</title>
     *     </head>
     *     <body/>
     * </html>
     * }
     * </pre>
     * 
     * @param title desired text content for title tag. If null, no text will be added.
     * @return basic HTML Document. 
     */
    public static Document makeBasicHtmlDoc(String title) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType docType = new DocumentTypeImpl(null, "html",
                "-//W3C//DTD XHTML 1.0 Strict//EN",
                "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd");
        htmlDoc.appendChild(docType);
        Element htmlElement = htmlDoc.createElementNS("http://www.w3.org/1999/xhtml", "html");
        htmlDoc.appendChild(htmlElement);
        Element headElement = htmlDoc.createElement("head");
        htmlElement.appendChild(headElement);
        Element titleElement = htmlDoc.createElement("title");
        if(title != null)
            titleElement.setTextContent(title);
        headElement.appendChild(titleElement);
        Element bodyElement = htmlDoc.createElement("body");
        htmlElement.appendChild(bodyElement);

        return htmlDoc;
    }

    /**
     * This method will allow us to create a our
     * MyHTMLDocumentImpl from an existing Document.
     */
    public static Document createFrom(Document doc) {
        Document htmlDoc = new MyHTMLDocumentImpl();
        DocumentType originDocType = doc.getDoctype();
        if(originDocType != null) {
            DocumentType docType = new DocumentTypeImpl(null, originDocType.getName(),
                    originDocType.getPublicId(),
                    originDocType.getSystemId());
            htmlDoc.appendChild(docType);
        }
        Node docElement = doc.getDocumentElement();
        if(docElement != null) {
            Node copiedDocElement = docElement.cloneNode(true);
            htmlDoc.adoptNode(copiedDocElement);
            htmlDoc.appendChild(copiedDocElement);
        }
        return htmlDoc;
    }

    private MyHTMLDocumentImpl() {
        super();
    }

    @Override
    public Element createElement(String tagName) throws DOMException {
        return super.createElement(tagName.toLowerCase());
    }

    @Override
    public Element createElementNS(String namespaceURI, String qualifiedName) throws DOMException {
        return super.createElementNS(namespaceURI, qualifiedName.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagName(String tagname) {
        return super.getElementsByTagName(tagname.toLowerCase());
    }

    @Override
    public NodeList getElementsByTagNameNS(String namespaceURI, String localName) {
        return super.getElementsByTagNameNS(namespaceURI, localName.toLowerCase());
    }

    @Override
    public Node renameNode(Node n, String namespaceURI, String qualifiedName) throws DOMException {
        return super.renameNode(n, namespaceURI, qualifiedName.toLowerCase());
    }
}

测试员:

import java.io.File;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.OutputStream;

import org.w3c.dom.DOMConfiguration;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import org.w3c.dom.NodeList;
import org.w3c.dom.bootstrap.DOMImplementationRegistry;
import org.w3c.dom.ls.DOMImplementationLS;
import org.w3c.dom.ls.LSOutput;
import org.w3c.dom.ls.LSSerializer;


public class HTMLDocumentTest {

    private final static int P_ELEMENT_NUM = 3;

    public static void main(String[] args) //I'm throwing all my exceptions here to shorten the example, but obviously you should handle them appropriately.
            throws ClassNotFoundException, InstantiationException, IllegalAccessException, ClassCastException, IOException {

        Document htmlDoc = MyHTMLDocumentImpl.makeBasicHtmlDoc("My Title");

        //populate the html doc with some example content
        Element bodyElement = (Element) htmlDoc.getElementsByTagName("body").item(0);
        for(int i = 0; i < P_ELEMENT_NUM; ++i) {
            Element pElement = htmlDoc.createElement("p");
            String id = Integer.toString(i+1);
            pElement.setAttribute("id", "anId"+id);
            pElement.setTextContent("Here is some text"+id+".");
            bodyElement.appendChild(pElement);
        }

        //get the title element in a case insensitive manner.
        NodeList titleNodeList = htmlDoc.getElementsByTagName("tItLe");
        for(int i = 0; i < titleNodeList.getLength(); ++i)
            System.out.println(titleNodeList.item(i).getTextContent());

        System.out.println();

        {//get all p elements searching with lowercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("p");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        {//get all p elements searching with uppercase
            NodeList pNodeList = htmlDoc.getElementsByTagName("P");
            for(int i = 0; i < pNodeList.getLength(); ++i) {
                System.out.println(pNodeList.item(i).getTextContent());
            }
        }

        System.out.println();

        //to serialize
        DOMImplementationRegistry registry = DOMImplementationRegistry.newInstance();
        DOMImplementationLS domImplLS = (DOMImplementationLS) registry.getDOMImplementation("LS");

        LSSerializer lsSerializer = domImplLS.createLSSerializer();
        DOMConfiguration domConfig = lsSerializer.getDomConfig();
        domConfig.setParameter("format-pretty-print", true);  //if you want it pretty and indented

        LSOutput lsOutput = domImplLS.createLSOutput();
        lsOutput.setEncoding("UTF-8");

        //to write to file
        try (OutputStream os = new FileOutputStream(new File("myFile.html"))) {
            lsOutput.setByteStream(os);
            lsSerializer.write(htmlDoc, lsOutput);
        }

        //to print to screen
        System.out.println(lsSerializer.writeToString(htmlDoc)); 
    }

}

输出:

My Title

Here is some text1.
Here is some text2.
Here is some text3.

Here is some text1.
Here is some text2.
Here is some text3.

<?xml version="1.0" encoding="UTF-8"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

另一种与上述类似的方法是创建一个 Document包装 Document 的包装器对象并实现 Document界面本身。这需要比“扩展 DocumentImpl”方法更多的代码，但这种方式“更干净”，因为我们不必关心特定的 Document实现。这种方法的额外代码并不难；为 Document 提供所有这些包装器实现有点乏味。方法。我还没有完全解决这个问题，可能会有一些问题，但如果它有效，这是一般的想法:

public class MyHTMLDocumentWrapper implements Document {

    private Document doc;

    public MyHTMLDocumentWrapper(Document doc) {
        //...
        this.doc = doc;
        //...
    }

    //...
}

是否org.w3c.dom.html.HTMLDocument ，我上面提到的方法之一，或其他方法，也许这些建议将帮助您了解如何进行。

编辑:

在我尝试解析以下 XHTML 文件时的解析测试中，Xerces 会在尝试打开 http 连接的实体管理类中挂起。为什么我不知道？特别是因为我在没有实体的本地 html 文件上进行了测试。 (也许与 DOCTYPE 或命名空间有关？)这是文档:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC 
    "-//W3C//DTD XHTML 1.0 Strict//EN"
    "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title>My Title</title>
    </head>
    <body>
        <p id="anId1">Here is some text1.</p>
        <p id="anId2">Here is some text2.</p>
        <p id="anId3">Here is some text3.</p>
    </body>
</html>

关于java - 如何构建 HTML org.w3c.dom.Document？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/29041855/

java - 如何构建 HTML org.w3c.dom.Document？

上一篇：javascript - 如何在智能表中按日期对项目进行排序

下一篇：javascript - 如何找到不比选择器更深的元素？