假设我正在使用递归循环来弹性发现和定位 DOM 元素,这些元素将在网站的半结构化和半统一 HTML DOM 文档中工作。
例如,当抓取网站上的链接并发现它的 xpath 位置有细微变化时。需要弹性以允许灵活的不间断爬行。
1)
I know that I want a link which is located on a certain region of the page distinguishable from the rest (ex. menu's footer, header etc.)
2)
It's distinguishable since it appears to be inside a table and pargraph or container.
3)
There can be an acceptable level of unexpected parents or children before this desired link mentioned in1)
but I don't know what. More unexpected elements would mean departure from1)
.
4)
Identifying via element's id and class or any other unique attribute value is not desired.
我认为下面的 xpath 应该总结:
/`/p/table/tr/td/a`
在某些页面上 xpath 有变化,但它仍然符合 1) 所需链接
//p/div/table/tr/td/a
或 //p/div/span/span/table/tr/td/b/a
我使用缩进来模拟每个循环迭代(
(我应该使用复数还是单数? child 与 child 。 parent 与 parent 。我认为单数是有意义的,因为这里关注的是直接 parent 或 child 。)
自上而下搜索:
how many p's are there ?
how many these p's have table as child ? If none, search next sub level.
how many these table's have tr as child ? If none, search next sub level.
how many these tr have td as child ? If none, search next sub level.
how many these td have a as child ?
自下而上搜索:
how many a's are there ?
how many of these a's have td as parent ? If none, look up to the next super level.
how many of these td have tr as parent ? If none, look up to the next super level.
how many of these tr have table as parent ? If none, look up to the next super level.
how many of these table have p as a parent ? If none, look up to the next super level.
自上而下还是自下而上重要吗?我觉得自上而下是没有用的,效率低下,如果在循环结束时找不到想要的 anchor 链接。
我想我还会测量在循环的每次迭代中发现了多少意想不到的 parent 或 child ,并将与我熟悉的预设常量进行比较 ex) 说不超过 2。如果有 3 个或更多意想不到的在发现我想要的 anchor 链接之前 parent 或 child 迭代,我会假设这不是我要找的东西。
这是正确的方法吗?这只是我想到的东西。如果这个问题不清楚,我深表歉意,我已经尽力了。我很乐意就此算法获得一些意见。
最佳答案
似乎你想要这样的东西:
//p//table//a
如果您对路径中的中间元素数量有限制,比如说不超过 2 个,那么上面的内容将修改为:
//p[not(ancestor::*[3])]
//table[ancestor::*[1][self::p] or ancestor::*[2][self::p]]
/tr/td//a[ancestor::*[1][self::td] or ancestor::*[2][self::td]]
这将选择所有 a
元素,其父元素或祖父元素为 td
,其父元素为 tr
,其父元素为 table
,其父或祖 parent 是 p
,其祖先元素少于 3 个。
关于algorithm - 在 HTML DOM 文档中搜索元素的自顶向下或自底向上方法?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/4567066/