algorithm - 在 HTML DOM 文档中搜索元素的自顶向下或自底向上方法？

假设我正在使用递归循环来弹性发现和定位 DOM 元素，这些元素将在网站的半结构化和半统一 HTML DOM 文档中工作。

例如，当抓取网站上的链接并发现它的 xpath 位置有细微变化时。需要弹性以允许灵活的不间断爬行。

1) I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)

2) It's distinguishable since it appears to be inside a table and pargraph or container.

3) There can be an acceptable level of unexpected parents or children before this desired link mentioned in 1) but I don't know what. More unexpected elements would mean departure from 1).

4) Identifying via element's id and class or any other unique attribute value is not desired.

我认为下面的 xpath 应该总结:

/`/p/table/tr/td/a`

在某些页面上 xpath 有变化，但它仍然符合 1) 所需链接

//p/div/table/tr/td/a 或 //p/div/span/span/table/tr/td/b/a

我使用缩进来模拟每个循环迭代(

(我应该使用复数还是单数？ child 与 child 。 parent 与 parent 。我认为单数是有意义的，因为这里关注的是直接 parent 或 child 。)

自上而下搜索:

how many p's are there ?
 how many these p's have table as child ? If none, search next sub level. 
   how many these table's have tr as child ? If none, search next sub level.
     how many these tr have td as child ? If none, search next sub level.
      how many these td have a as child ?

自下而上搜索:

how many a's are there ?
 how many of these a's have td as parent ? If none, look up to the next super level.
  how many of these td have tr as parent ? If none, look up to the next super level.
   how many of these tr have table as parent ? If none, look up to the next super level.
    how many of these table have p as a parent ? If none, look up to the next super level.

自上而下还是自下而上重要吗？我觉得自上而下是没有用的，效率低下，如果在循环结束时找不到想要的 anchor 链接。

我想我还会测量在循环的每次迭代中发现了多少意想不到的 parent 或 child ，并将与我熟悉的预设常量进行比较 ex) 说不超过 2。如果有 3 个或更多意想不到的在发现我想要的 anchor 链接之前 parent 或 child 迭代，我会假设这不是我要找的东西。

这是正确的方法吗？这只是我想到的东西。如果这个问题不清楚，我深表歉意，我已经尽力了。我很乐意就此算法获得一些意见。

最佳答案

似乎你想要这样的东西:

//p//table//a

如果您对路径中的中间元素数量有限制，比如说不超过 2 个，那么上面的内容将修改为:

//p[not(ancestor::*[3])]
      //table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
               /tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]

这将选择所有 a 元素，其父元素或祖父元素为 td，其父元素为 tr，其父元素为 table，其父或祖 parent 是 p，其祖先元素少于 3 个。

关于algorithm - 在 HTML DOM 文档中搜索元素的自顶向下或自底向上方法？，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/4567066/

algorithm - 在 HTML DOM 文档中搜索元素的自顶向下或自底向上方法？

上一篇：c# - 我如何从这个数据结构创建一个图表？

下一篇：在书籍布局中放置文本/非文本内容的算法