html - 使用html敏捷包忽略XPATH中的标签

我正在使用以下代码将html表从html文件解析为数据集：

Public Function GetDataSet(html As String) As DataSet
Dim ds As DataSet = New DataSet
Dim htmldoc As New HtmlAgilityPack.HtmlDocument
htmldoc.LoadHtml(html)
Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
                                 .GroupBy(Function(x) x.ParentNode)
For i As Integer = 0 To tables.Count - 1
    Dim rows = tables(i).ToList()
    ds.Tables.Add(String.Format("Table {0}", i))
    Dim headers = rows(0).Elements("th").Select(Function(x) x.InnerText.Trim).ToList()
    For Each Hr In headers
        ds.Tables(i).Columns.Add(Hr)
    Next
    For j As Integer = 1 To rows.Count - 1
        Dim row = rows(j)
        Dim dr = row.Elements("td").Select(Function(x) x.InnerText.Trim).ToArray()
        ds.Tables(i).Rows.Add(dr)
    Next
Next
Return ds
End Function

而且效果很好。但是，如果在<Table>标记之前的<tr>标记内放置了一个标记，则不会解析该表

简单的例子：

<html>
<head><title>Test</title></head>
<body>
    <div>Contents:</div>
    <table>
        <tr>
            <th>Column1</th> <th>Column2</th>
        </tr>
        <tr>
            <td>1</td> <td>11</td>
        </tr>
        <tr>
            <td>2</td> <td>22</td>
        </tr>
    </table>
    <table>
       <tbody>
        <tr>
            <th>Column1</th> <th>Column2</th> <th>Column3</th>
        </tr>
        <tr>
            <td>a</td> <td>aa</td> <td>aaa</td>
        </tr>
        <tr>
            <td>b</td> <td>bb</td> <td>bbb</td>
        </tr>
       </tbody>
    </table>
    <table>
       <div>
        <tr>
            <th>Column1</th> <th>Column2</th> <th>Column3</th>
        </tr>
        <tr>
            <td>a</td> <td>aa</td> <td>aaa</td>
        </tr>
        <tr>
            <td>b</td> <td>bb</td> <td>bbb</td>
        </tr>
       </div>
    </table>
</body>
</html>

在此示例中，仅解析第一个表。

我的问题是如何在以下代码行中忽略<Table>标记和<tr>标记之间的任何标记：

Dim tables = htmldoc.DocumentNode.SelectNodes("//table/tr") _
                             .GroupBy(Function(x) x.ParentNode)

并且所有表都将被解析。

最佳答案

您可以使用//从所有后代中进行选择：

Dim rows = htmldoc.DocumentNode.SelectNodes("//table//tr");

同样根据您的要求，最好根据第一个祖先table对结果进行分组，因为tr的父级可能是tbody或thead，并且您需要对表中的行进行分组：

Dim tables = htmldoc.DocumentNode.SelectNodes("//table//tr") _
                    .GroupBy(Function(x) x.Ancestors("table").First())

关于html - 使用html敏捷包忽略XPATH中的标签，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/40349532/

html - 使用html敏捷包忽略XPATH中的标签

上一篇：xml - 具有可选条件的XPath查询(仅在标记存在时才应用条件)

下一篇：html - 最后一个<a>元素的XPath表达式，其“href”属性以数字开头？