java - 用于屏幕抓取的 Mozilla 解析器

标签 java dom parsing screen-scraping mozilla

我正在编写一个应用程序，它接收页面的 HTML 代码并提取页面的某些元素(例如表格)并返回这些元素的 html 代码。我尝试使用 Mozilla 解析器在 java 中执行此操作，以简化页面导航，但在提取所需的 html 代码时遇到问题。

也许我的整个方法是错误的，又名 Mozilla 解析器，所以如果有更好的解决方案，我愿意接受建议

String html = ///what ever the code is

MozillaParser p = // instantiate parser


// pass in html to parse which creates a dom object
Document d = p.parse(html);

// get a list of all the form elements in the page
NodeList l =  d.getElementsByTagName("form");

// iterate through all forms
for(int i = 0; i < l.getLength(); i++){

    // get a form
    Node n = l.item(i);

    // print out the html code for just this form.
    // This is the portion I haven't figured out.
    // I just made up the innerHTML method, but thats
    // the end result I'm desiring, a way to just see
    // the html code for a particular node
    System.out.println( n.innerHTML() );
}

最佳答案

Mozilla 解析器在这里似乎有点矫枉过正，我用过 Jericho就您正在做的事情而言，取得了一些成功。

关于java - 用于屏幕抓取的 Mozilla 解析器，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1494668/

上一篇：java - JAXB - 将节点添加到 XML 作为 html 链接

下一篇：java - jpa版本级联

相关文章：

Java、 hibernate 、 Spring

javascript - 在特定位置处理 Android 后退按钮 - Cordova/Phonegap

javascript - DOM 没有完全加载？

java - 解析编码数据时出现问题？

java - 应用程序重新启动后重新打开 JTree 并保存其状态

java - 运行时获取数据库信息

java - 对元素的 DOM 属性序列进行排序

php - 如何在 PHP 中从任何 YouTube 网址动态制作 YouTube iframe

java - MassIndexing 时 Hibernate Search 中引用代理的 transient 方法

javascript - 按位置访问文档元素时检测数组