我有一个要从中提取日期的文件,它是一个 HTML 源文件,所以它充满了我不需要的代码和短语。我需要提取包含在特定 HTML 标记中的日期的每个实例:
abbr title="((这是我需要的文本))"data-utime="
实现这一目标的最简单方法是什么?
最佳答案
如果您使用 Excel VBA,请设置对 MSHTML 库的引用(工具 - 引用)(在引用菜单中标题为 Microsoft HTML Object Library
)
Sub ScrapeDateAbbr()
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Dim sFile As String, lFile As Long
Dim sHtml As String
'read in the file
lFile = FreeFile
sFile = "C:/Users/dick/Documents/My Dropbox/Excel/Testabbr.html"
Open sFile For Input As lFile
sHtml = Input$(LOF(lFile), lFile)
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = sHtml
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
我假设该文件是本地的,因为您调用了源文件。如果您需要先下载它,您需要另一个对 MSXML 和此代码的引用
Sub ScrapeDateAbbrDownload()
Dim xHttp As MSXML2.XMLHTTP
Dim hDoc As MSHTML.HTMLDocument
Dim hElem As MSHTML.HTMLGenericElement
Set xHttp = New MSXML2.XMLHTTP
xHttp.Open "GET", "file:///C:/Users/dick/Documents/My%20Dropbox/Excel/Testabbr.html"
xHttp.send
Do
DoEvents
Loop Until xHttp.readyState = 4
'put into an htmldocument object
Set hDoc = New MSHTML.HTMLDocument
hDoc.body.innerHTML = xHttp.responseText
'loop through abbr tags
For Each hElem In hDoc.getElementsByTagName("abbr")
'only those that have a data-utime attribute
If Len(hElem.getAttribute("data-utime")) > 0 Then
'get the title attribute
Debug.Print hElem.getAttribute("title")
End If
Next hElem
End Sub
关于excel - 从 HTML 标记内的文件中抓取文本,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/9758107/