我从这个 URL https://accessgudid.nlm.nih.gov/devices/10806378034350 抓取了这个 VBA 脚本
我想要批处理、序列号和有效期信息,在下图中,HTML 中包含"is"或“否”。
如何返回"is"或“否”信息?
Sub GetInnerInformation(HTMLPage As MSHTML.HTMLDocument)
Dim HTMLResult As MSHTML.IHTMLElement
Dim HTMLResults As MSHTML.IHTMLElementCollection
Dim HTMLRow As MSHTML.IHTMLElement
Dim HTMLCell As MSHTML.IHTMLElement
Dim RowNum As Long, ColNum As Integer
Set HTMLResults = HTMLPage.getElementsByClassName("device-attribute")
For Each HTMLResult In HTMLResults
If (HTMLResult.innerText Like "*Lot*") = True Then
Debug.Print HTMLResult.innerText, HTMLResult.outerText, HTMLResult.innerHTML
End If
Next HTMLResult
End Sub
在我的即时窗口中我得到:
Lot or Batch Number: Lot or Batch Number: Lot or Batch Number:
因此没有提及 HTML 中的"is"或“否”。
最佳答案
HTML 解析器:
你可以使用CSS attribute = value selector使用 [?] 定位位于感兴趣的 div
之前的 span
。然后使用 parentElement
爬到共享父级,并使用 NextSibling
移动到感兴趣的 div
。然后,您可以使用 getElementsByTagName
获取 labels
节点,并循环该 nodeList 以写出所需的信息。要获取与标签关联的值,您再次需要使用 NextSibling
来处理父级 div
中的 br
子级。
我使用 xmlhttp 发出请求,这比打开浏览器更快。
Option Explicit
Public Sub WriteOutYesNos()
Dim html As MSHTML.HTMLDocument
Set html = New MSHTML.HTMLDocument
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350", False
.send
html.body.innerHTML = .responseText
End With
Dim nodes As Object, i As Long
Set nodes = html.querySelector("[title*='A production identifier (PI) is a variable']").parentElement.NextSibling.getElementsByTagName("LABEL")
For i = 0 To nodes.Length - 3
With ActiveSheet
.Cells(i + 1, 1) = nodes(i).innerText
.Cells(i + 1, 2) = nodes(i).NextSibling.NodeValue
End With
Next
End Sub
JSON 解析器:
数据也可以作为 json 提供,这意味着您可以使用 json 解析器来处理。我使用 jsonconverter.bas 作为 json 解析器来处理响应。从 here 下载原始代码并添加到名为 JsonConverter
的标准模块。然后,您需要转到 VBE > 工具 > 引用 > 添加对 Microsoft Scripting Runtime
的引用。从复制的代码中删除顶部的 Attribute
行。
Option Explicit
Public Sub WriteOutYesNos()
Dim json As Object, ws As Worksheet, results(), i As Long, s As String
Set ws = ThisWorkbook.Worksheets("Sheet1")
results = Array("lotBatch", "serialNumber", "manufacturingDate")
With CreateObject("MSXML2.XMLHTTP")
.Open "GET", "https://accessgudid.nlm.nih.gov/devices/10806378034350.json", False
.send
Set json = JsonConverter.ParseJson(.responseText)
End With
With ws
For i = LBound(results) To UBound(results)
.Cells(i + 1, 1) = results(i)
.Cells(i + 1, 2).Value = IIf(json(results(i)), "Yes", "No")
Next
End With
End Sub
XML 解析器:
结果也以 xml 形式出现,只要您正确处理默认命名空间,您就可以使用 xml 解析器对其进行解析:
Option Explicit
Public Sub WriteOutYesNos()
Dim xmlDoc As Object, ws As Worksheet, results(), i As Long
Set xmlDoc = CreateObject("MSXML2.DOMDocument")
Set ws = ThisWorkbook.Worksheets("Sheet1")
results = Array("lotBatch", "serialNumber", "manufacturingDate")
With xmlDoc
.validateOnParse = True
.setProperty "SelectionLanguage", "XPath"
.setProperty "SelectionNamespaces", "xmlns:i='http://www.fda.gov/cdrh/gudid'"
.async = False
If Not .Load("https://accessgudid.nlm.nih.gov/devices/10806378034350.xml") Then
Err.Raise .parseError.ErrorCode, , .parseError.reason
Exit Sub
End If
End With
With ws
For i = LBound(results) To UBound(results)
.Cells(i + 1, 1) = results(i)
.Cells(i + 1, 2).Value = IIf(xmlDoc.SelectSingleNode("//i:" & results(i)).Text, "Yes", "No")
Next
End With
End Sub
关于html - VBA 中的网页抓取,某些 HTML 信息无法引用它,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/58810189/