java - 如何使用 Apache Tika 编写自定义 ContentHandler？

我想提取某些标签内的文本，例如 <dt> , <dd>使用 Apache Tika 从 HTML 文件中获取等。

所以我正在写自定义ContentHandler它应该从这些标签中提取信息。

我的定制ContentHandler代码如下所示。它尚未完成，但已经无法按预期工作:

public class TableContentHandler implements ContentHandler {

    // key = abbreviation
    // value = information / description for abbreviation
    private Map<String, String> abbreviations = new HashMap<String, String>();

    // current abbreviation
    private String abbreviation = null;

    // <dd> element contains abbreviation. So this boolean variable will be set when
    // <dd> element is found
    private boolean ddElementStarted = false;

    // this method is not giving contents within <dd> and </dd> tags
    public void characters(char[] chars, int arg1, int arg2) throws SAXException {
            if(ddElementStarted) {
                    System.out.println("chars found...");
            }
    }

    // set boolean ddElementStarted to true to indicate that content handler found 
    // <dd> element
    public void startElement(String arg0, String element, String arg2, Attributes arg3) throws SAXException {
            if(element.equalsIgnoreCase("dd")) {
                    ddElementStarted = true;
            }
    }
}

这里我的假设是，一旦内容处理程序进入 startElement()方法和元素名称是 dd那么我将设置 ddElementStarted = true然后获取 <dd> 里面的内容和</dd>元素，我会 checkin characters()方法。

在 characters()方法我正在检查是否 ddElementStarted = true和chars数组的内容将在 <dd> 范围内和</dd>元素，但它不起作用:(

我想知道是否

我的方向正确吗？
这是使用 Tika 解析 HTML 的正确方法吗？或者还有其他办法吗？
我应该选择其他 HTML 解析 API，例如 JSoup？我只需要来自几个标签的信息，例如我对 HTML 页面的其余部分不感兴趣。
有没有办法指定XPath Apache Tika 中的表达式？我在 Tika in Action 中找不到此信息书。

最佳答案

简单的解决方案是 Jsoup 。我们可以轻松获取任何标签内的值。因此，无需编写新的 ContentHandler，只需使用 JSoup 进行解析即可。

关于java - 如何使用 Apache Tika 编写自定义 ContentHandler？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/19297218/

java - 如何使用 Apache Tika 编写自定义 ContentHandler？

上一篇：java - 在 JUnit : only binds fields in Test class, 中 Autowiring ，不在其他类中 Autowiring

下一篇：java.lang.NoClassDefFoundError : net. sf.jasperreports.engine.util.JRStyledTextParser(初始化失败)