html - 从 HTML 文档中抓取最大的文本 block

标签 html screen-scraping text-extraction html-content-extraction

我正在研究一种算法，该算法将尝试在给定 HTML 文件的情况下挑选出它认为最有可能包含大部分页面内容文本的父元素。例如，它会选择以下 HTML 中的 div“内容”:

<html>
   <body>
      <div id="header">This is the header we don't care about</div>
      <div id="content">This is the <b>Main Page</b> content.  it is the
      longest block of text in this document and should be chosen as
      most likely being the important page content.</div>
   </body>
</html>

我想到了一些想法，例如遍历 HTML 文档树到它的叶子，将文本的长度加起来，如果父项给我们的内容比子项多，则只看父项还有什么其他文本做。

有没有人试过这样的东西，或者知道可以应用的算法？它不必是固定的，但只要它能猜出包含大部分页面内容文本(例如文章或博客文章)的容器，那就太棒了。

最佳答案

一个字:Boilerpipe

关于html - 从 HTML 文档中抓取最大的文本 block ，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/289468/