vba、getElementsByClassName、HTMLSource的双引号不见了

标签 vba excel getelementsbyclassname

我用 VBA 抓取一些网站是为了好玩,我使用 VBA 作为工具。我使用 XMLHTTP 和 HTMLDocument(因为它比 internetExplorer.Application 更快)。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim XMLHTTPReq As Object
    Dim htmlDoc As HTMLDocument

    Dim postURL As String

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

        Set XMLHTTPReq = New MSXML2.XMLHTTP

        With XMLHTTPReq
            .Open "GET", postURL, False
            .Send
        End With

        Set htmlDoc = New HTMLDocument
        With htmlDoc
            .body.innerHTML = XMLHTTPReq.responseText
        End With

        i = 0

        Set varTemp = htmlDoc.getElementsByClassName("post_glass post_micro_glass")

        For Each vr In varTemp
            ''''the next line is important to solve this issue *1
            Cells(1, 1) = vr.outerHTML
            Set varTemp2 = vr.getElementsByTagName("SPAN class=post_date")
            Cells(i + 1, 3) = varTemp2.Item(0).innerText
            ''''the next line occur 438Error''''
            Set varTemp2 = vr.getElementsByClassName("hover_inner")
            Cells(i + 1, 4) = varTemp2.innerText

            i = i + 1

        Next vr
End Sub

我通过*1解决了这个问题 cells(1,1) 向我展示接下来的事情

<DIV class="post_glass post_micro_glass" title=""><A class=hover title="" href="http://foodffs.tumblr.com/post/134291668251/sugar-free-low-carb-coffee-ricotta-mousse-really" target=_blank>
<DIV class=hover_inner><SPAN class=post_date>...............

是的,所有的类标签都丢失了“”。只有第一个函数的类有“” 我实在不知道为什么会出现这种情况。

//我可以通过 getElementsByTagName("span") 进行分词。但我更喜欢“class”标签......

最佳答案

getElementsByClassName method本身不被视为一种方法;仅父 HTMLDocument 的。如果您想使用它来定位 DIV 元素中的元素,则需要创建一个由该特定 DIV 元素的 .outerHtml 组成的子 HTMLDocument。

Public Sub XMLhtmlDocumentHTMLSourceScraper()

    Dim xmlHTTPReq As New MSXML2.XMLHTTP
    Dim htmlDOC As New HTMLDocument, divSUBDOC As New HTMLDocument
    Dim iDIV As Long, iSPN As Long, iEL As Long
    Dim postURL As String, nr As Long, i As Long

    postURL = "http://foodffs.tumblr.com/archive/2015/11"

    With xmlHTTPReq
        .Open "GET", postURL, False
        .Send
    End With

    'Set htmlDOC = New HTMLDocument
    With htmlDOC
        .body.innerHTML = xmlHTTPReq.responseText
    End With

    i = 0

    With htmlDOC
        For iDIV = 0 To .getElementsByClassName("post_glass post_micro_glass").Length - 1
            nr = Sheet1.Cells(Rows.Count, 3).End(xlUp).Offset(1, 0).Row
            With .getElementsByClassName("post_glass post_micro_glass")(iDIV)
                'method 1 - run through multiples in a collection
                For iSPN = 0 To .getElementsByTagName("span").Length - 1
                    With .getElementsByTagName("span")(iSPN)
                        Select Case LCase(.className)
                            Case "post_date"
                                Cells(nr, 3) = .innerText
                            Case "post_notes"
                                Cells(nr, 4) = .innerText
                            Case Else
                                'do nothing
                        End Select
                    End With
                Next iSPN
                'method 2 - create a sub-HTML doc to facilitate getting els by classname
                divSUBDOC.body.innerHTML = .outerHTML  'only the HTML from this DIV
                With divSUBDOC
                    If CBool(.getElementsByClassName("hover_inner").Length) Then 'there is at least 1
                        'use the first
                        Cells(nr, 5) = .getElementsByClassName("hover_inner")(0).innerText
                    End If
                End With
            End With
        Next iDIV
    End With

End Sub

虽然其他 .getElementsByXXXX 可以轻松检索另一个元素中的集合,但 getElementsByClassName method需要考虑它所认为的 HTMLDocument 作为一个整体,即使你欺骗了它这样认为。

关于vba、getElementsByClassName、HTMLSource的双引号不见了,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34302502/

相关文章:

vba - Excel 2003 : Active-X bug from Dec 2014 "update" -- still broke. 还有什么要尝试的?

vba - Access 2016 超出系统资源

vba - DoCmd.TransferSpreadsheet acImport 错误 "External table is not in the expected format"

html - 如何将一个多行的垂直列表导出到水平面上有标题的excel表中?

javascript - 每次与其他元素发生碰撞时调用函数

javascript - jQuery getelementsbyclassname 等效项

vba - Access 中没有 max(x,y) 函数

excel - 优化VBA数组而不是范围?

Excel,如果列大于零并且另一列包含 "Competitor",则将另一列的值减一

javascript - 最小化 getElementById/getElementsByClassName 的代码