java - 是否有类似于 Java 的 lxml 或 nokogiri 的库?

标签 java screen-scraping

<分区>

我想做一些屏幕抓取,最好使用 CSS 选择器而不是 XPath。是否有类似于 Ruby 或 Python 中的库?

最佳答案

有许多用 Java 编写的屏幕抓取库。仅举几例:

  • TagSoup - a SAX-compliant parser written in Java that, instead of parsing well-formed or valid XML, parses HTML as it is found in the wild: nasty and brutish, though quite often far from short. TagSoup is designed for people who have to process this stuff using some semblance of a rational application design. By providing a SAX interface, it allows standard XML tools to be applied to even the worst HTML.
  • Jericho HTML Parser - Jericho HTML Parser is a simple but powerful java library allowing analysis and manipulation of parts of an HTML document, including some common server-side tags, while reproducing verbatim any unrecognised or invalid HTML. It also provides high-level HTML form manipulation functions. t is neither an event nor tree based parser, but rather uses a combination of simple text search, efficient tag recognition and a tag position cache. The text of the whole source document is first loaded into memory, and then only the relevant segments searched for the relevant characters of each search operation.
  • HTML Cleaner - HtmlCleaner reorders individual elements and produces well-formed XML from dirty HTML. It follows similar rules that the most of web-browsers use in order to create document object model. A user may provide custom tag and rule set for tag filtering and balancing.
  • NekoHTML - NekoHTML is a simple HTML scanner and tag balancer that enables application programmers to parse HTML documents and access the information using standard XML interfaces. The parser can scan HTML files and "fix up" many common mistakes that human (and computer) authors make in writing HTML documents. NekoHTML adds missing parent elements; automatically closes elements with optional end tags; and can handle mismatched inline element tags.

还有更多 HTML Screen Scraping Tools written in Java .但正如我在 this previous answer 中提到的,这些是 IMO 处理任何类型内容(理解所有类型的废话)的最佳选择。 .不过,这对您来说可能不是问题。

以防万一,也许可以查看线程 Nokogiri pure Java status .

更新: 发布了一个新项目(2010-01-31),jsoup ,它提供了一个 selector-syntax to find elements .有关详细信息和/或 this answer,请访问其网站来自其作者。

关于java - 是否有类似于 Java 的 lxml 或 nokogiri 的库?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/2122779/

相关文章:

java - setContentPane() 和 addActionListener 收到 NullPointerException

java - 如何创建其中包含 MySQL 数据库的可执行文件?

java - 如何从 asp 调用 java?

python - 如何扫描网页并获取图像和 YouTube 嵌入?

php - 重定向两次到一个网站

java - 使用 Flink 1.2 从 Avro 文件读取数据

java - 从 Graphics2D (JAVA) 获取 awt.Image 数据

android - 在 webview 中覆盖网页样式 - android

language-agnostic - 屏幕抓取问题

c++ - DXGI_OUTDUPL_DESC 的 DesktopImageInSystemMemory