xpath - 如何解析以下html代码获取 "br"标签之前的所有文本

我有以下 html 代码:

    <td class="role" style=""><a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Microsoft">Microsoft</a><br />
    <a href="/wiki/Chairman">Chairman</a> of <a href="/wiki/Corbis">Corbis</a><br />
    Co-Chair of the <a href="/wiki/Bill_%26_Melinda_Gates_Foundation">Bill &amp; Melinda   Gates Foundation</a><br />
    <a href="/wiki/Creative_Director" title="Creative Director" class="mw- redirect">Director</a> of <a href="/wiki/Berkshire_Hathaway">Berkshire Hathaway</a><br/>
    <a href="/wiki/CEO" class="mw-redirect" title="CEO">CEO</a> of <a  href="/wiki/Cascade_Investment">Cascade Investment</a></td>

对于上面的 td 元素，语义上有五行，由 "<br/>" 分隔，我想得到五行:

Chairman of Microsoft

Chariman of Borbis

Co-Char of the Bill&Melinda Gates Fundation

Creative Director of Berkshire Hathaway

CEO of Cascade Investment

目前，我的解决方案是先获取所有br在这里面 td ，如:

    br_value = td_node.select('.//br')

然后对于每个 br_value，我使用以下代码获取所有文本:

    for br_item in br_value:
        one_item = br_item.select('.//preceding-sibling::*/text()').extract()

在这种情况下，我可以获得如下行:

Chairman Microsoft

Chariman Borbis

Bill&Melinda Gates Fundation

Director Berkshire Hathaway

CEO Cascade Investment

和我想要的原文相比，他们基本上漏掉了“的”，还有一些其他的文字。

这是因为“preceding-sibling”只返回兄弟标签，而不能返回属于其父标签的“文本”，例如本例中的“of”。

这里的任何人都知道如何提取由 br 分隔的完整信息标签？

谢谢

最佳答案

使用this xpath 查询:

//div[@id='???']/descendant-or-self::*[not(ancestor-or-self::script or ancestor-or-self::noscript or ancestor -or-self::style)]/text()

即要仅从当前节点和所有后代节点中选择文本，请使用这种查询:./descendant-or-self::*/text()

或更短(感谢 Empo):.//text()

关于xpath - 如何解析以下html代码获取 "br"标签之前的所有文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/7289376/

xpath - 如何解析以下html代码获取 "br"标签之前的所有文本

上一篇：ruby-on-rails - 我在哪里将我的 ui.router 模板放在 rails 应用程序中？

下一篇：nginx - uWSGI 工作人员都很忙，但每秒请求数非常低