html - 在什么时候使用处理函数可以提高 HTML 解析效率?

标签 html xml r

这道题用的是R语言。它也被标记为 [xml] 和 [html],以防这些用户可能对问题有任何意见。


对于 XML 包,我一直认为使用处理函数来解析在 C 级创建的 HTML 文档会提高整体效率。但是,我已经工作了一段时间,以找到一个可以真正实现该想法的情况。

我想我可能没有在正确的上下文中考虑这种情况(即处理程序可能对更大的递归文档更有用?)。不管怎样,这是我的做法。

举下面两个例子。


library(XML)
library(microbenchmark)
u <- "http://www.baseball-reference.com"

示例1:获取名为“input”(搜索表单名称)的所有节点的属性

withHandler1 <- function() {
    h <- function() {
        input <- character()
        list(input = function(node, ...) {
            input <<- c(input, list(xmlAttrs(node, ...)))
            node
        },
            value = function() input)
    }
    h1 <- h()
    htmlParse(u, handler = h1)
    h1$value()
}

withoutHandler1 <- function() {
    xmlApply(htmlParse(u)["//input"], xmlAttrs)
}

identical(withHandler1(), withoutHandler1())
# [1] TRUE

microbenchmark(withHandler1(), withoutHandler1(), times = 25L)
# Unit: milliseconds
#              expr      min       lq     mean   median       uq     max neval cld
#    withHandler1() 944.6507 1001.419 1051.602 1020.347 1097.073 1315.23    25   a
# withoutHandler1() 964.6079 1006.799 1040.905 1039.993 1069.029 1126.49    25   a

好吧,这是一个非常基本的例子,但时间几乎是一样的,我觉得如果我默认运行它 100 次,它们可能会收敛。


示例2:获取名为“input”的所有节点的属性子集

withHandler2  <- function() {    
    searchBoxHandler <- function(attr = character()) {
        input <- character()
        list(input = function(node, ...) {
            input <<- c(input, list(
                if(identical(attr, character())) xmlAttrs(node, ...)
                else vapply(attr[attr %in% names(xmlAttrs(node))],
                    xmlGetAttr, "", node = node)
            ))
            node
        },
            value = function() input)
    }
    h1 <- searchBoxHandler(attr = c("id", "type"))
    htmlParse(u, handler = h1)
    h1$value()
}    

withoutHandler2 <- function() {
    xmlApply(htmlParse(u)["//input"], function(x) {
        ## Note: match() used only to return identical objects
        xmlAttrs(x)[na.omit(match(c("id", "type"), names(xmlAttrs(x))))]
    })
}

identical(withHandler2(), withoutHandler2())
# [1] TRUE

microbenchmark(withHandler2(), withoutHandler2(), times = 25L)
# Unit: milliseconds
#              expr      min        lq     mean   median       uq      max neval cld
#    withHandler2() 966.0951 1010.3940 1129.360 1038.206 1119.642 2075.070    25   a
# withoutHandler2() 962.8655  999.4754 1166.231 1046.204 1118.661 2385.782    25   a

同样,非常基础。但也差不多。


所以我的问题是,为什么要使用处理函数?对于这些示例,事实证明编写处理程序是浪费精力。那么是否有特定的操作可能会非常昂贵,在解析 HTML 时,我通过使用处理函数看到速度和效率的显着提高?

最佳答案

引用 XML 维基百科上的文章,编程接口(interface)部分:

  1. 用于 XML 处理的现有 API 往往属于以下类别: 可从编程语言访问的面向流的 API,用于 例如 SAXStAX
  2. 可通过编程语言访问的树遍历 API,用于 例如DOM
  3. XML 数据绑定(bind),它提供了一个 XML 文档和编程语言对象。
  4. 声明式转换语言,例如 XSLTXQuery

Stream-oriented facilities require less memory and, for certain tasks which are based on a linear traversal of an XML document, are faster and simpler than other alternatives. Tree-traversal and data-binding APIs typically require the use of much more memory, but are often found more convenient for use by programmers; some include declarative retrieval of document components via the use of XPath expressions. XSLT is designed for declarative description of XML document transformations, and has been widely implemented both in server-side packages and Web browsers. XQuery overlaps XSLT in its functionality, but is designed more for searching of large XML databases.

很明显,性能并不是唯一要考虑的因素,例如:

SAX is fast and efficient to implement, but difficult to use for extracting information at random from the XML, since it tends to burden the application author with keeping track of what part of the document is being processed. It is better suited to situations in which certain types of information are always handled the same way, no matter where they occur in the document.

另一方面:

The Document Object Model (DOM) is an interface-oriented application programming interface that allows for navigation of the entire document as if it were a tree of node objects representing the document's contents. A DOM document can be created by a parser, or can be generated manually by users (with limitations). Data types in DOM nodes are abstract; implementations provide their own programming language-specific bindings. DOM implementations tend to be memory intensive, as they generally require the entire document to be loaded into memory and constructed as a tree of objects before access is allowed.

总结一下:

你的例子不是一个活生生的例子,数据可以更大,只有这样情况才会决定使用最好的接口(interface)。

关于html - 在什么时候使用处理函数可以提高 HTML 解析效率?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/29043280/

相关文章:

html - 内容div的高度不随div内的内容增加

html - CSS - 将下拉箭头更改为 unicode 三 Angular 形

html - 需要 ul 的中心对齐,其 <li> 向左浮动

python - 如何使用Python获取XML文件中child->child->child->child的内容

java - 从 xml 文件生成 html 文件

mysql - 如何根据前一行在接下来的几行中插入值

javascript - Chrome.runtime.onMessage 无法将消息从弹出窗口发送到内容脚本

xml - 未找到 Java 类型的 Jersey Client 消息正文阅读器

r - 一次为环境分配多个值

r - 临时键入 data.table 以进行合并