python - XPath:通过*纯*文本查找 HTML 元素

请注意:可以找到此问题的更精确版本，并提供适当的答案 here .

我想使用 Selenium Python 绑定(bind)在网页上查找具有给定文本的元素。例如，假设我有以下 HTML:

<html>
    <head>...</head>
    <body>
        <someElement>This can be found</someElement>
        <someOtherElement>This can <em>not</em> be found</someOtherElement>
    </body>
</html>

我需要按文本搜索并能够找到 <someElement>使用以下 XPath:

//*[contains(text(), 'This can be found')]

我正在寻找可以让我找到 <someOtherElement> 的类似 XPath使用纯文本 "This can not be found" .以下不起作用:

//*[contains(text(), 'This can not be found')]

我知道这是因为嵌套的 em “中断”“无法找到”文本流的元素。是否有可能通过 XPath 以某种方式忽略与上述类似的嵌套？

最佳答案

您可以使用 //*[contains(., 'This can not be found')] .

上下文节点.在与“This can not be found”比较之前将被转换为其字符串表示形式。

不过要小心，因为您正在使用 //* , 因此它将匹配包含此字符串的 ALL englobing 元素。

在您的示例中，它将匹配:

<someOtherElement>
和<body>
和<html> !

您可以通过定位文档中的特定元素标签或特定部分来限制这一点(<table> 或 <div> 具有已知的 id 或类)

在关于如何找到与文本条件匹配的最多嵌套元素的评论中编辑 OP 的问题:

The accepted answer here建议 //*[count(ancestor::*) = max(//*/count(ancestor::*))]选择最嵌套的元素。我认为它只是 XPath 2.0。

结合您的子字符串条件，我能够 test it here附上这份文件

<html>
<head>...</head>
<body>
    <someElement>This can be found</someElement>
    <nested>
        <someOtherElement>This can <em>not</em> be found most nested</someOtherElement>
    </nested>
    <someOtherElement>This can <em>not</em> be found</someOtherElement>
</body>
</html>

和这个 XPath 2.0 表达式

//*[contains(., 'This can not be found')]
   [count(ancestor::*) = max(//*/count(./*[contains(., 'This can not be found')]/ancestor::*))]

并且匹配包含“This cannot be found most nested”的元素。

可能有更优雅的方法来做到这一点。

关于python - XPath:通过*纯*文本查找 HTML 元素，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/18655765/

python - XPath:通过纯文本查找 HTML 元素

上一篇：python - 用 numpy/python 推断数据

下一篇：python - matplotlib - 沿绘图线更改标记颜色