jquery - 关于可读性代码的 jQuery 等效项有什么想法吗？ (或: building the best heuristic to find the main text using jQuery)

http://lab.arc90.com/experiments/readability/是一个非常方便的工具，可以以非常可读的方式查看杂乱的报纸、期刊和博客页面。它通过使用一些启发式方法并查找网页的相关正文来实现这一点。其源代码也可在 http://lab.arc90.com/experiments/readability/js/readability.js 获取。

当我正在努力使用 jQuery 来获取任何报纸的“正文”时，我的一些同事提请我注意这一点 |期刊|博客 |等网站。我当前的启发式(以及 jQuery 中的实现)使用类似的东西(这是在 Firefox Jetpack 包内完成的):

$(doc).find("div > p").each(function (index) {  
    var textStr = $(this).text();
/*
     We need the pieces of text that are long and in natural language,
     and not some JS code snippets
    */
if(textStr.length > MIN_TEXT_LENGTH && textStr.indexOf("<script") <= 0) {    
    console.log(index);    
    console.log(textStr.length);
    console.log(textStr);
    $(this).attr("id", "clozefox_paragraph_" + index);
    results.push(index);

    wholeText = wholeText + " " + textStr;
}
});

所以这就像“抓取 DIV 内的段落并检查不相关的字符串，例如“script””。我已经尝试过这个，大多数时候它可以抓取网络文章的正文，但是我想要一个更好的启发式或者更好的 jQuery 选择机制(甚至更短？)。

您有更好的建议吗？

PS:也许“找到最里面的 DIV(即没有 DIV 类型的任何子元素)并仅获取它们的

”对于我当前的目的来说可能是一个更好的启发式，但我不能了解如何在 jQuery 中表达这一点。

最佳答案

这通常是通过分析页面上元素的“链接密度”来完成的。链接密度越高，越有可能不是内容。这是开始思考内容提取技术和算法的好地方:http://www.quora.com/Whats-the-best-method-to-extract-article-text-from-HTML-documents

关于jquery - 关于可读性代码的 jQuery 等效项有什么想法吗？ (或: building the best heuristic to find the main text using jQuery)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/1947199/

jquery - 关于可读性代码的 jQuery 等效项有什么想法吗？ (或: building the best heuristic to find the main text using jQuery)

上一篇：jquery - 如何使用 jQuery 跨浏览器计算元素的高度？

下一篇：.net - 提高ajax请求的性能