android - ContentHandler 以错误的顺序进行回调

我正在使用 ContentHandler 来解析带有 css 样式的自定义 html。问题是 - 当我尝试使用 UL 标记解析 HTML 时，ContentHandler 行为异常。它调用 startTag() 然后调用 endTag() 然后调用 characters()

这是我的 HTML

<html>
<head>
<META http-equiv="Content-Type" content="text/html; charset=UTF-8">
<style>ul.ul1{list-style-type:image;}
</style>
</head>
<body>
<ul class="ul1">List</ul>
<ul class="ul2">List</ul>
</body>
</html>

这是测试解析器的示例代码

public class ContentHandler implements org.xml.sax.ContentHandler {
    public ContentHandler() {
    }

    public Spanned getResult() {
    }

    @Override
    public void setDocumentLocator(Locator locator) {
    }

    @Override
    public void startDocument() throws SAXException {
    }

    @Override
    public void endDocument() throws SAXException {
    }

    @Override
    public void startPrefixMapping(String prefix, String uri) throws SAXException {
    }

    @Override
    public void endPrefixMapping(String prefix) throws SAXException {
    }

    @Override
    public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
        Log.d("html_parser", "start " + localName);
    }

    @Override
    public void endElement(String uri, String localName, String qName) throws SAXException {
        Log.d("html_parser", "end " + localName);
    }

    @Override
    public void characters(char[] ch, int start, int length) throws SAXException {
        String bodyText = new String(ch, start, length);
        Log.d("html_parser", bodyText);
    }

    @Override
    public void ignorableWhitespace(char[] ch, int start, int length) throws SAXException {
    }

    @Override
    public void processingInstruction(String target, String data) throws SAXException {
    }

    @Override
    public void skippedEntity(String name) throws SAXException {
    }
}

和 LogCat 输出

02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start html
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end meta
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ ul.ul1{list-style-type:image;}
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end style
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end head
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ start ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end ul
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ List
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end body
02-13 13:18:41.555  13211-13211/com.example D/html_parser﹕ end html

请注意，当我解析没有 UL 标签的 HTML 时，它工作正常。另请注意，为了解析，使用了 org.ccil.cowan.tagsoup.jaxp.SAXParserImpl。

最佳答案

我已经测试了你的问题并发现了一些有趣的事实。你已经使用 SAX 解析器来解析 html，所以 html 与 xml 有很多不同。例如有时标签可以打开等。所以 org.ccil.cowan.tagsoup.jaxp.SAXParserImpl 允许我们解析 html。该解析器还包装了一些附加标签 https://github.com/websdotcom/tagsoup#what-tagsoup-does .在下一个代码中查找 html。如果您要添加正确的内容结构，它会正常处理。所以我认为这就像 TagSoup 库中的错误。

import android.test.AndroidTestCase;
import android.util.Log;

import org.ccil.cowan.tagsoup.jaxp.SAXParserImpl;
import org.xml.sax.Attributes;
import org.xml.sax.InputSource;
import org.xml.sax.SAXException;
import org.xml.sax.XMLReader;
import org.xml.sax.helpers.DefaultHandler;

import java.io.IOException;
import java.io.Reader;
import java.io.StringReader;

import javax.xml.parsers.SAXParser;

/**
 * Created by kulik on 1/5/14.
 */
public class SaxTest extends AndroidTestCase {
    private static final String TAG = "SaxTest";

    public void testSax() {
        String testString = "<!DOCTYPE html>\n" +
                "<html>\n" +
                "<head>\n" +
               "<META http-equiv=\"Content-Type\" content=\"text/html; charset=UTF-8\">\n" +
                "<style>ul.ul1{list-style-type:image;}\n" +
                "</style>\n" +
                "</head>\n" +
                "<body>\n" +
                "<ul class=\"ul1\">List</ul>\n" +
                "<ul class=\"ul2\">" +
                "<li> li1</li>\n" +
                "<li> li2</li>\n" +
                "</ul>" +
                "</body>\n" +
                "</html>";

        Reader reader = new StringReader(testString);
        try {
            SAXParser sp = SAXParserImpl.newInstance(null);
            XMLReader xr = sp.getXMLReader();

            DefaultHandler myHandler = new ContentHandler();
            xr.setContentHandler(myHandler);
            xr.parse(new InputSource(reader));
        } catch (SAXException e) {
            Log.e(TAG, "", e);
        } catch (IOException e) {
            Log.e(TAG, "", e);
        }
    }

    public class ContentHandler extends DefaultHandler  {

        @Override
        public void startElement(String uri, String localName, String qName, Attributes atts) throws SAXException {
            Log.d("html_parser", "start " + localName);
        }

        @Override
        public void endElement(String uri, String localName, String qName) throws SAXException {
            Log.d("html_parser", "end " + localName);
        }

        @Override
        public void characters(char[] ch, int start, int length) throws SAXException {
            String bodyText = new String(ch, start, length);
            Log.d("html_parser", bodyText);
        }
    }
}

和日志

  D/html_parser﹕ start html
  D/html_parser﹕ start head
  D/html_parser﹕ start meta
  D/html_parser﹕ end meta
  D/html_parser﹕ start style
  D/html_parser﹕ ul.ul1{list-style-type:image;}
  D/html_parser﹕ end style
  D/html_parser﹕ end head
  D/html_parser﹕ start body
  D/html_parser﹕ start ul
  D/html_parser﹕ end ul
  D/html_parser﹕ List
  D/html_parser﹕ start ul
  D/html_parser﹕ start li
  D/html_parser﹕ li1
  D/html_parser﹕ end li
  D/html_parser﹕ start li
  D/html_parser﹕ li2
  D/html_parser﹕ end li
  D/html_parser﹕ end ul
  D/html_parser﹕ end body
  D/html_parser﹕ end html

所以你可以实现你的处理程序来捕捉这种情况，因为我认为这只与没有任何标签的标签相关

。可能是因为:

The semantics of TagSoup are as far as practical those of actual HTML browsers. In particular, never, never will it throw any sort of syntax error: the TagSoup motto is "Just Keep On Truckin'". But there's much, much more. For example, if the first tag is LI, it will supply the application with enclosing HTML, BODY, and UL tags. Why UL? Because that's what browsers assume in this situation. For the same reason, overlapping tags are correctly restarted wheneve.......

http://home.ccil.org/~cowan/XML/tagsoup/

你也可以问tagsoup团队。

关于android - ContentHandler 以错误的顺序进行回调，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/21752923/

android - ContentHandler 以错误的顺序进行回调

上一篇：css - 奇怪的行为 div 和 float :left

下一篇：c# - 用于在 background-url 中查找 url 内值的正则表达式