php - 如何获得两个HTML标记之间的所有内容？ (使用XPath吗？)

编辑：我添加了一种解决方案，在这种情况下。

我想从页面中提取表，并且我想（可能）使用DOMDocument和XPath。但是，如果您有更好的主意，请告诉我。

我的第一次尝试是这样（显然是错误的，因为它将获得第一个关闭表标记）：

<?php 
    $tableStart = strpos($source, '<table class="schedule"');
    $tableEnd   = strpos($source, '</table>', $tableStart);
    $rawTable   = substr($source, $tableStart, ($tableEnd - $tableStart));
?>

我很难，这可以通过DOMDocument和/或xpath解决。

最后，我希望标签（在这种情况下为标签）和标签自身之间的所有内容。因此，所有HTML不仅包含值（例如，不仅包含“值”，还包含“值”）。还有一个“捕获” ...

该表中有其他表。因此，如果仅搜索表的末尾（“标签”），则可能会得到错误的标签。
开头标签包含一个您可以用来识别它的类（classname ='schedule'）。

这可能吗？

这是我想从另一个网站中提取的（简化的）源代码：（我还想显示html标签，而不仅仅是值，所以要显示带有“ schedule”类的整个表格）

<table class="schedule">
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- The problematic tag...
    <table class="annoying nested table">
        Lots of table rows, etc.
    </table> <-- a problematic tag...

    This could even be variable content. =O =S

</table>

最佳答案

首先，请注意XPath是基于XML Infopath的-XML模型，其中没有“开始标签”和“结束标签”，而只有节点

因此，不应期望XPath表达式选择“标签”，而是选择节点。

考虑到这一事实，我将问题解释为：

我想获取给定“开始”之间的所有元素的集合
元素和给定的“结束元素”，包括开始和结束元素。

在XPath 2.0中，可以使用标准运算符intersect方便地完成此操作。

在XPath 1.0（我假设您正在使用）中，这并不是那么容易。解决方案是对节点集相交使用Kayessian（@Michael Kay）公式：

通过评估以下XPath表达式，选择两个节点集：$ns1和$ns2的交集：

$ns1[count(.|$ns2) = count($ns2)]

假设我们有以下XML文档（您从未提供过）：

<html>
    <body>
        <table>
            <tr valign="top">
                <td>
                    <table class="target">
                        <tr>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Starting Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Inner Node</td>
                            <td>Ending Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                            <td>Other Node</td>
                        </tr>
                    </table>
                </td>
            </tr>
        </table>
    </body>
</html>

起始元素通过以下方式选择：

//table[@class = 'target']
         //td[. = 'Starting Node']

末端元素通过以下方式选择：

//table[@class = 'target']
         //td[. = Ending Node']

为了获得所有想要的节点，我们将以下两个集合相交：

该集合由start元素和所有随后的元素组成（我们将其命名为$vFollowing）。
由end元素和所有前面的元素组成的集合（我们将其命名为$vPreceding）。

这些分别通过以下XPath表达式选择：

$ v关注：

$vStartNode | $vStartNode/following::*

$ vPreceding：

$vEndNode | $vEndNode/preceding::*

现在我们可以简单地在节点集$vFollowing和$vPreceding上应用Kayessian公式：

       $vFollowing
          [count(.|$vPreceding)
          =
           count($vPreceding)
          ]

剩下的就是用它们各自的表达式替换所有变量。

基于XSLT的验证：

<xsl:stylesheet version="1.0"
 xmlns:xsl="http://www.w3.org/1999/XSL/Transform">
 <xsl:output omit-xml-declaration="yes" indent="yes"/>
 <xsl:strip-space elements="*"/>

 <xsl:variable name="vStartNode" select=
 "//table[@class = 'target']//td[. = 'Starting Node']"/>

 <xsl:variable name="vEndNode" select=
 "//table[@class = 'target']//td[. = 'Ending Node']"/>

 <xsl:variable name="vFollowing" select=
 "$vStartNode | $vStartNode/following::*"/>

 <xsl:variable name="vPreceding" select=
 "$vEndNode | $vEndNode/preceding::*"/>

 <xsl:template match="/">
      <xsl:copy-of select=
          "$vFollowing
              [count(.|$vPreceding)
              =
               count($vPreceding)
              ]"/>
 </xsl:template>
</xsl:stylesheet>

当应用于以上XML文档时，将评估XPath表达式并输出所需的，正确的，结果选择的节点集：

<td>Starting Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Inner Node</td>
<td>Ending Node</td>

关于php - 如何获得两个HTML标记之间的所有内容？ (使用XPath吗？)，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/8950582/

php - 如何获得两个HTML标记之间的所有内容？ (使用XPath吗？)

上一篇：xpath - 使用 XPath，如何获取不属于超链接的文本节点

下一篇：xml - 从一个元素中具有最大整数的文档返回 xml