html - 使用VBA从网站中抓取innerHTML

标签 html vba excel web-scraping

我试图声明一个节点数组(这不是问题),然后抓取数组每个元素内两个子节点的 innerHTML - 以 SE 为例(一个使用 IE 对象方法),假设我试图抓取主页上问题的标题和摘录,有一个节点数组(类名:“question-summary")。

然后有两个子节点(图 block - 类名称:“问题超链接”和提取 - 类名称:“摘录”)我的代码使用如下:

Sub Scraper()
Dim ie As Object
Dim doc As Object, oQuestionShells As Object, oQuestionTitle As Object, oQuestion As Object, oElement As Object
Dim QuestionShell As String, QuestionTitle As String, Question As String, sURL As String

Set ie = CreateObject("internetexplorer.application")
sURL = "https://stackoverflow.com/questions/tagged/excel-formula"

QuestionShell = "question-summary"
QuestionTitle = "question-hyperlink"
Question = "excerpt"

With ie
    .Visible = False
    .Navigate sURL
End With

Set doc = ie.Document 'Stepping through so doc is getting assigned (READY_STATE = 4)

Set oQuestionShells = doc.getElementsByClassName(QuestionShell)

For Each oElement In oQuestionShells
    Set oQuestionTitle = oElement.getElementByClassName(QuestionTitle) 'Assigning this object causes an "Object doesn't support this property or method"
    Set oQuestion = oElement.getElementByClassName(Question) 'Assigning this object causes an "Object doesn't support this property or method"
    Debug.Print oQuestionTitle.innerHTML
    Debug.Print oQuestion.innerHTML
Next

End Sub

最佳答案

getElementByClassName 不是方法。

您只能使用返回 IHTMLElementCollectiongetElementsByClassName(注意方法名称中的复数)。

使用 Object 代替 IHTMLElementCollection 很好 - 但您仍然必须通过提供索引来访问集合中的特定元素。

假设对于每个 oElement,只有一个 question-summary 类实例和一个 question-hyperlink 类实例>。然后,您只需使用 getElementsByClassName 并在末尾使用 (0) 即可取出返回的数组的第一个元素。

所以你的代码更正是:

Set oQuestionTitle = oElement.getElementsByClassName(QuestionTitle)(0)
Set oQuestion = oElement.getElementsByClassName(Question)(0)

完整的工作代码(有一些更新,即使用Option Explicit并等待页面加载):

Option Explicit

Sub Scraper()

    Dim ie As Object
    Dim doc As Object, oQuestionShells As Object, oQuestionTitle As Object, oQuestion As Object, oElement As Object
    Dim QuestionShell As String, QuestionTitle As String, Question As String, sURL As String

    Set ie = CreateObject("internetexplorer.application")
    sURL = "https://stackoverflow.com/questions/tagged/excel-formula"

    QuestionShell = "question-summary"
    QuestionTitle = "question-hyperlink"
    Question = "excerpt"

    With ie
        .Visible = True
        .Navigate sURL
        Do
            DoEvents
        Loop While .ReadyState < 4 Or .Busy
    End With

    Set doc = ie.Document

    Set oQuestionShells = doc.getElementsByClassName(QuestionShell)

    For Each oElement In oQuestionShells
        'Debug.Print TypeName(oElement)

        Set oQuestionTitle = oElement.getElementsByClassName(QuestionTitle)(0)
        Set oQuestion = oElement.getElementsByClassName(Question)(0)

        Debug.Print oQuestionTitle.innerHTML
        Debug.Print oQuestion.innerHTML
    Next

    ie.Quit

End Sub

关于html - 使用VBA从网站中抓取innerHTML,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/44238865/

相关文章:

c# - NPOI 将字体应用于整行单元格

javascript - 使用 x 可编辑内联模式自动保存在数据库中,无需在表上添加提交按钮

c# - 如何使用 VBA(或 C#)将 MS Access 2007 中的链接表复制到本地表?

excel - 无法为 Excel UDF 返回正确的错误类型

vba - 如何优化(或尽可能避免)Excel VBA 中的循环

vba - 使用 VBA 在工作簿文件夹中循环代码?

javascript - 使用 HTML/JS 将输入保存到文本文件?

html - 将不同类别的对象放在同一行中

html - 如何仅删除字段集边框而不删除字段集内元素的边框?

vba - Excel vba清除一系列定期变化的数据