excel - 通过更改类名进行抓取

标签 excel vba web-scraping

我正在尝试从网页中提取姓名、地址、角色、状态、任命时间、辞职时间(如果有),下面有一个代码示例。
问题是每家公司的董事人数可能不同,我不确定如何确定总董事人数(类 appointment-1)= x,所以我可以遍历它们。
HTLM 代码:

<div class="appointments-list">
    <div class="appointment-1">
        <h2 class="heading-medium">
            <span id="officer-name-1">
                <a href="/officers/Oo16GI3lS3HEgrIR-kCpmLYbDWw/appointments"  onclick="javascript:_paq.push(['trackGoal', 5]);">BUCKSEY, Nicholas</a>
            </span>
        </h2>
        <dl>
            <dt id="officer-address-field-1">Correspondence address</dt>
            <dd class="data" id="officer-address-value-1">
1 St James&#39;s Square, London, SW1Y 4PD                    </dd>
        </dl>
        <div class="grid-row">
            <dl class="column-quarter">
                <dt>Role
                       <span id="officer-status-tag-1" class="status-tag font-xsmall">Active</span>
                </dt>
                <dd id="officer-role-1" class="data">
                    Secretary
                </dd>
            </dl>
            <dl class="column-quarter">
                <dt>Appointed on</dt>
                <dd id="officer-appointed-on-1" class="data">
                    1 June 2020
                </dd>
            </dl>
        </div> 
        <div class="grid-row"></div> 
        <div class="grid-row"></div> 
        <div class="grid-row"></div> 
    </div>
    <div class="appointment-2">
        <h2 class="heading-medium heading-with-border">
            <span id="officer-name-2">
                <a href="/officers/IND_i3_G7Gqq3ZzC3P0rXYbUcNU/appointments"  onclick="javascript:_paq.push(['trackGoal', 5]);">MATHEWS, Benedict John Spurway</a>
            </span>
        </h2>
    </h2>

    <dl>
        <dt id="officer-address-field-2">Correspondence address</dt>
        <dd class="data" id="officer-address-value-2">
1 St James&#39;s Square, London, SW1Y 4PD                    </dd>
    </dl>

    <div class="grid-row">
        <dl class="column-quarter">
            <dt>Role
                   <span id="officer-status-tag-2" class="status-tag font-xsmall">Active</span>
            </dt>
            <dd id="officer-role-2" class="data">
                Secretary
            </dd>
        </dl>
        <dl class="column-quarter">
            <dt>Appointed on</dt>
            <dd id="officer-appointed-on-2" class="data">
                7 May 2019
            </dd>
        </dl>
    </div> 

    <div class="grid-row"></div> 
    <div class="grid-row"></div> 
    <div class="grid-row"></div> 
</div>
VBA 代码:我正在尝试使用 querySelectorall但无法“识别”正确的类 ID。
Sub ChangeTab()
    Set ie = CreateObject("InternetExplorer.Application")
    ie.Visible = True
    ie.navigate "https://find-and-update.company-information.service.gov.uk/company/00102498/officers"

    Do While ie.readyState <> 4: DoEvents: Loop
    
    'Application.Wait (Now + TimeValue("0:00:02"))
    ' Dim i As Long, secNumberNodeList As Object, secNumberNode As Object
 
    Set secNumberNodeList = ie.Document.querySelectorAll("appointments-list")
 
    For Each sc In secNumberNodeList 
        Debug.Print sc.getElementById("officer-name-1")
        Debug.Print sc.getElementById("officer-address-value-1")
        Debug.Print sc.getElementById("officer-status-tag-1")
        Debug.Print sc.getElementById("officer-appointed-on-1")
        Debug.Print sc.getElementById("officer-appointed-on-1")
        Debug.Print sc.getElementById("officer-resigned-on-16")
    Next
End Sub

最佳答案

这是执行此操作的可靠方法之一。我使用 XMLHttpRequest 而不是 IE。我试图展示如何使用循环来访问所有容器的内容。尝试在循环中定义您感兴趣的其他字段来解析它们。

Option Explicit
Sub GetInformation()
    Const URL = "https://find-and-update.company-information.service.gov.uk/company/00102498/officers"
    Dim Http As Object, Html As HTMLDocument, I&
    Dim HtmlDoc As HTMLDocument, sName$, sAddress$

    Set Html = New HTMLDocument
    Set HtmlDoc = New HTMLDocument
    Set Http = CreateObject("MSXML2.XMLHTTP")

    With Http
        .Open "GET", URL, False
        .send
        Html.body.innerHTML = .responseText
    End With

    With Html.querySelectorAll(".appointments-list > [class^='appointment-']")
        For I = 0 To .Length - 1
            HtmlDoc.body.innerHTML = .Item(I).outerHTML
            sName = HtmlDoc.querySelector("h2 > span > a").innerText
            sAddress = HtmlDoc.querySelector(".data[id^='officer-address-value-']").innerText
            Debug.Print sName, sAddress
        Next I
    End With
End Sub
执行上述脚本需要添加的引用:
1. Microsoft XML, v6.0
2. Microsoft HTML Object Library

关于excel - 通过更改类名进行抓取,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/66834718/

相关文章:

Excel 公式 - 使用单元格本身提供的条件

python - 从网站抓取表格时遇到问题?

vba - Outlook VBA 在 Excel 工作表中找到最后一行

vba - Excel vba 中的 "No Cells Found Error"

python - 使用xpath获取图像

android - 使用python在android上抓取网页

c# - 如何验证文件是否为有效的 Excel 电子表格?

php - 无法通过 PHP 生成长度超过 255 个字符的字段的 Excel

excel - 工作簿对象在关闭后仍然存在,并且工作簿在工作簿关闭后的任何后续 excel 操作中重新打开

vba - 如何将我的宏更改为 Worksheet_Change 事件 Excel VBA