xml - XPath 查询查找整个 HTML 文档中所有未标记的文本

给定以下 HTML，是否有一个 XPath 查询可以提取两个 <h2> 之间的所有标记和未标记文本标签？ (我在 RStudio 中使用 RSelenium 包。)

<html>
    <h2 id="section1" class="article">Heading 1</h2>
    <h3 id="section1.1" class="article">Subheading 1</h3>
    <p id="para001"  class="article section clear">
           Paragraph text 1.</p> 
    <div id="formula1" class="formula">...<img />...</div>
           Untagged text 1.
    <sub>  Subscripted text. </sub>
           Untagged text 2. 
    <em>   Emphasized text. </em>
           Untagged text 3.
    <span id="bib"> Bibliography text. </span>
           Untagged text 4.
    <p id="para002" class="article section clear">
           Paragraph text 2.</p>
    <h3 id="section1.2" class="article">Subheading 2</h3>
    <p id="para003" class="article section clear">
           Paragraph 3 text.</p>
    <h3 id="section1.3" class="article">Subheading 3</h3>
    <p id="para004" class="article section clear">
           Paragraph 4 text.</p>
    <h2 id="section2" class="article">Heading 2</h2>       
</html>

我正在尝试提出一个将返回的查询:

Paragraph text 1.
Untagged text 1.
Subscripted text.
Untagged text 2. 
Emphasized text.
Untagged text 3.
Bibliography text.
Untagged text 4.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.

到目前为止我尝试过的是，

//p[preceding-sibling::h2[@id='section1'] 
    and following-sibling::h2[@id='section2'] 
    and descendant::node()]

返回，

Paragraph text 1.
Paragraph text 2.
Paragraph text 3.
Paragraph text 4.

我尝试使用 this question 的解决方案，但我的问题有点复杂。我尝试添加 following-sibling::text()[1] ，但它不会提取未标记的文本。如果没有好的 XPath 解决方案，那么我很乐意欢迎使用 CSS 选择器等替代方法。

最佳答案

首先，您不想仅过滤 p 标签(这就是第三个字母中的 p 的作用)，您希望在第 1 部分之后和第 2 部分之前过滤所有标签。其次，您正在寻找这两个文本节点之间的标签的所有后代。

因此:查找具有 preceding-sibling::h2[@id='section1'] 和 following-sibling::h2[@id='section2'] 的所有标签:

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]

然后查找以下任何一个下面的所有 text() 标记:

//*[preceding-sibling::h2[@id='section1'] and following-sibling::h2[@id='section2']]//text()

关于xml - XPath 查询查找整个 HTML 文档中所有未标记的文本，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/35205168/

xml - XPath 查询查找整个 HTML 文档中所有未标记的文本

上一篇：php - XPath 查询不返回答案

下一篇：xml - 选择名称包含在另一个节点中的元素？