html - 使用 VBA 从网站提取数据时遇到问题

标签 html excel vba web-scraping

我需要从以下网站提取 COCOA London Dec20 和 Mar21 收盘价 https://www.mrci.com/ohlc/ohlc-all.php Screenshot for the website

我为此编写了以下代码,但它抛出错误,请帮助:

Sub extract()

Dim appIE As Object

Set appIE = CreateObject("internetexplorer.application")

With appIE

    .Navigate "https://www.mrci.com/ohlc/ohlc-all.php"

    .Visible = False

End With

Do While appIE.Busy

    DoEvents

Loop

Set allRowOfData = appIE.document.getElementsByClassName("strat").getElementsByTagName("tbody")(183)

Dim myValue As String: myValue = allRowOfData.Cells(5).innerHTML

appIE.Quit

Set appIE = Nothing

lastrow = Sheets("Sheet2").Cells(Rows.Count, "A").End(xlUp).Row + 1

Range("A" & lastrow).Value = myValue

End Sub
  • 代码抛出以下错误:

Run-Time Error - 438 Object doesn't support the property or method.

最佳答案

您的错误是因为您正在集合上调用文档/节点方法。

Set allRowOfData = appIE.document.getElementsByClassName("strat").getElementsByTagName("tbody")(183)

getElementsByClassName("strat")返回一个集合,您需要对其进行索引,然后使用方法 getElementsByTagName .

例如

Set allRowOfData = appIE.document.getElementsByClassName("strat")(0).getElementsByTagName("tbody")(183)

总结方法:

由于不同的数据 futures排列在一长串表行(tr)节点中,需要确定trs的正确 block 。由领先future header ( th ),检查后续兄弟 trs为右mmmyy然后从该行中提取右列 ( td ) 值。一个人需要在下一个开始时停止future block ,或同级 block trs 的末尾;以先到者为准。


tl;博士;

HTML 并不适合快速识别正确的 trs ;使用的索引越多,程序就越脆弱。以下确实有假设,但更稳健。

我使用 XMLHTTP request因为使用浏览器的费用是不必要的。您有一个感兴趣的 future 变量和所有标题 ths收集者:className使用 css 类选择器,进入 nodeList循环直到目标 future被发现。这会将您置于感兴趣的第一行的开头。各种mmmyy然后在后续行中。我通过使用辅助函数检查表标题并与定义的目标标题名称进行比较来确定适当的列索引。该函数返回找到 header 的相应索引,如果未找到则返回 -1。

现在,我有一本字典,其中包含 mmmyy兴趣时期。我循环所有tds直到我进入下一部分(即下一个 tr ,其标题( th )作为其 FirstChild )。我检查每个 FirstChild连续,如果 mmmyy找到的值在字典中我用适当的列值更新字典。

在循环中,我的工作级别低于 HTMLDocument所以,为了利用querySelectorAll ,我转储当前nextNode.NextSibling.OuterHTML成为代理人 HTMLDocument多变的;然后我可以访问querySelectorAll再次,可以选择合适的td按索引。我需要将传输的html包装在<TABLE><TD></TABLE>中让 HTML 解析器不提示并获得正确的标签 # td元素。非常肯定的是,如果我有时间重新审视这一点,我可能会加强这一点。

最后,我将字典值写到工作表中。


VBA:

Option Explicit

Public Sub GetCocoaClosePrices()
    Dim html As MSHTML.HTMLDocument, targetPeriods As Object, targetFuture As String, targetColumnName As String

    targetFuture = "London Cocoa(LCE)"
    targetColumnName = "Close"
    Set targetPeriods = CreateObject("Scripting.Dictionary")
    Set html = New MSHTML.HTMLDocument

    targetPeriods.Add "Dec20", "Not found"
    targetPeriods.Add "Mar21", "Not found"

    With CreateObject("MSXML2.XMLHTTP")
        .Open "GET", "https://www.mrci.com/ohlc/ohlc-all.php", False
        .Send
        html.body.innerHTML = .responseText
    End With

    Dim tableHeaders As Object, targetColumnNumber As Long

    Set tableHeaders = html.querySelectorAll("tr ~ tr ~ tr .colhead")
    targetColumnNumber = GetTargetColumnNumber(tableHeaders, targetColumnName)

    If targetColumnNumber = -1 Then Exit Sub

    Set targetPeriods = GetUpdatedDictionary(targetPeriods, html, targetFuture, targetColumnNumber)

    With ThisWorkbook.Worksheets(1)
        .Cells(1, 1).Resize(1, targetPeriods.Count) = targetPeriods.keys
        .Cells(2, 1).Resize(1, targetPeriods.Count) = targetPeriods.items
    End With
End Sub

Public Function GetUpdatedDictionary(ByRef targetPeriods As Object, ByVal html As HTMLDocument, ByVal targetFuture As String, ByVal targetColumnNumber As Long) As Object
    Dim html2 As MSHTML.HTMLDocument, firstChild As Object, i As Long
    Dim nextNode As Object, headerNodes As Object

    Set headerNodes = html.querySelectorAll(".note1")
    Set html2 = New MSHTML.HTMLDocument

    For i = 0 To headerNodes.Length - 1
        If headerNodes.Item(i).innerText = targetFuture Then 'find the right target future header
            Set nextNode = headerNodes.Item(i).ParentNode 'move up to the parent tr node

            Do 'walk the adjacent tr nodes
                Set nextNode = nextNode.NextSibling
                Set firstChild = nextNode.firstChild

                If nextNode Is Nothing Then
                    Set GetUpdatedDictionary = targetPeriods
                    Exit Function 'exit if no next section
                End If

                html2.body.innerHTML = "<TABLE><TD>" & nextNode.outerHTML & "</TABLE>"

                If targetPeriods.Exists(firstChild.innerText) Then
                    targetPeriods(firstChild.innerText) = html2.querySelectorAll("td").Item(targetColumnNumber).innerText
                End If
            Loop While firstChild.tagName <> "TH" 'stop at next section i present

        End If
    Next
    Set GetUpdatedDictionary = targetPeriods
End Function

Public Function GetTargetColumnNumber(ByVal nodeList As Object, ByVal targetColumnName As String) As Long
    Dim i As Long

    For i = 0 To nodeList.Length - 1
        If nodeList.Item(i).innerText = targetColumnName Then
            GetTargetColumnNumber = i + 1 'to account for th
            Exit Function
        End If
    Next
    GetTargetColumnNumber = -1
End Function

阅读:

  1. css selectors
  2. document.querySelectorAll
  3. Node.nextSibling
  4. Node.parentNode

引用文献(VBE>工具>引用文献):

  1. Microsoft HTML 对象库

关于html - 使用 VBA 从网站提取数据时遇到问题,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/59109387/

相关文章:

javascript - 使用剪贴板 API 时 chrome 中的非法构造函数?

excel - 如何在范围内找到特定颜色,然后如果单元格为 = "",则将值设为 0 并在单元格中保持相同的颜色

javascript - Node : smart JSON conversion to Excel file

Excel 发现无法读取的内容 - 数据验证

xml - 从 powerpoint 幻灯片中检索 Excel 图表数据(以编程方式)

html - 将文本换行至浏览器宽度或指定宽度(以较小者为准)

javascript - 导致 'Maximum call stack size exceeded' 的表单提交

html - 正确使用gumbo-parser来迭代并找到我需要的东西?

vba - 错误处理 VBA(如果 Err raise 则跳过此 block )

java - 无法获取 Apache POI 来进行 Excel 阅读