vba - Excel VBA 宏 : Scraping data from site table that spans multiple pages

标签 vba excel web-scraping

预先感谢您的帮助。我运行的是 Windows 8.1,拥有最新的 IE/Chrome 浏览器和最新的 Excel。我正在尝试编写一个从 StackOverflow ( https://stackoverflow.com/tags ) 提取数据的 Excel 宏。具体来说,我试图提取日期(运行宏的日期)、标签名称、标签数量以及标签的简要描述。我让它适用于表格的第一页,但不适用于其余页面(目前有 1132 页)。现在,每次运行宏时它都会覆盖数据,并且我不确定如何让它在运行之前查找下一个空单元格。最后,我试图让它每周自动运行一次。

我非常感谢这里的任何帮助。问题是:

  1. 从网络表格中提取首页以外的数据
  2. 使其将数据抓取到下一个空行而不是覆盖
  3. 使宏每周自动运行一次

代码(到目前为止)如下。谢谢!

Enum READYSTATE
READYSTATE_UNINITIALIZED = 0
READYSTATE_LOADING = 1
READYSTATE_LOADED = 2
READYSTATE_INTERACTIVE = 3
READYSTATE_COMPLETE = 4
End Enum

Sub ImportStackOverflowData()
  'to refer to the running copy of Internet Explorer
  Dim ie As InternetExplorer
  'to refer to the HTML document returned
  Dim html As HTMLDocument
  'open Internet Explorer in memory, and go to website
  Set ie = New InternetExplorer
  ie.Visible = False
  ie.navigate "http://stackoverflow.com/tags"

  'Wait until IE is done loading page
  Do While ie.READYSTATE <> READYSTATE_COMPLETE
    Application.StatusBar = "Trying to go to StackOverflow ..."
    DoEvents
  Loop

  'show text of HTML document returned
  Set html = ie.document

  'close down IE and reset status bar
  Set ie = Nothing
  Application.StatusBar = ""

  'clear old data out and put titles in
  'Cells.Clear
  'put heading across the top of row 3
  Range("A3").Value = "Date Pulled"
  Range("B3").Value = "Keyword"
  Range("C3").Value = "# Of Tags"
  'Range("C3").Value = "Asked This Week"
  Range("D3").Value = "Description"

  Dim TagList As IHTMLElement
  Dim Tags As IHTMLElementCollection
  Dim Tag As IHTMLElement
  Dim RowNumber As Long
  Dim TagFields As IHTMLElementCollection
  Dim TagField As IHTMLElement
  Dim Keyword As String
  Dim NumberOfTags As String
  'Dim AskedThisWeek As String
  Dim TagDescription As String
  'Dim QuestionFieldLinks As IHTMLElementCollection
  Dim TodaysDate As Date

  Set TagList = html.getElementById("tags-browser")
  Set Tags = html.getElementsByClassName("tag-cell")
  RowNumber = 4

  For Each Tag In Tags
    'if this is the tag containing the details, process it
    If Tag.className = "tag-cell" Then
      'get a list of all of the parts of this question,
      'and loop over them
      Set TagFields = Tag.all

      For Each TagField In TagFields
        'if this is the keyword, store it
        If TagField.className = "post-tag" Then
          'store the text value
          Keyword = TagField.innerText
          Cells(RowNumber, 2).Value = TagField.innerText
        End If

        If TagField.className = "item-multiplier-count" Then
          'store the integer for number of tags
          NumberOfTags = TagField.innerText
          'NumberOfTags = Replace(NumberOfTags, "x", "")
          Cells(RowNumber, 3).Value = Trim(NumberOfTags)
        End If

        If TagField.className = "excerpt" Then
          Description = TagField.innerText
          Cells(RowNumber, 4).Value = TagField.innerText
        End If

        TodaysDate = Format(Now, "MM/dd/yy")
        Cells(RowNumber, 1).Value = TodaysDate

      Next TagField

      'go on to next row of worksheet
      RowNumber = RowNumber + 1
    End If
  Next

  Set html = Nothing

  'do some final formatting
  Range("A3").CurrentRegion.WrapText = False
  Range("A3").CurrentRegion.EntireColumn.AutoFit
  Range("A1:C1").EntireColumn.HorizontalAlignment = xlCenter
  Range("A1:D1").Merge
  Range("A1").Value = "StackOverflow Tag Trends"
  Range("A1").Font.Bold = True
  Application.StatusBar = ""
  MsgBox "Done!"
End Sub

最佳答案

当 Stack Overflow 通过数据浏览器等工具向您提供基础数据时,无需抓取 Stack Overflow。在数据资源管理器中使用此查询应该会得到您需要的结果:

select t.TagName, t.Count, p.Body
 from Tags t inner join Posts p
 on t.ExcerptPostId = p.Id
 order by t.count desc;

该查询的永久链接是 here查询运行后出现的“下载 CSV”选项可能是将数据导入 Excel 的最简单方法。如果您想自动化这部分工作,则 CSV 结果下载的直接链接是 here

关于vba - Excel VBA 宏 : Scraping data from site table that spans multiple pages,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26664701/

相关文章:

VBA 读取/搜索文本文件

Excel VBA : Which OLE/OCX control to use to display a chart in a form?

VBA Excel选择命名范围内的单元格范围

excel - 在多个工作簿中查找一个字符串,然后查找它旁边的单元格的值

python - 从没有指导性 HTML 结构的具有挑战性的网站中抓取信息

mysql - 将日期转换为 mysql 中的数字(就像日期在 Excel 中的转换方式)

vba - Excel VBA : sorting a table featuring cells with drop-down lists

excel - 有没有办法在 Excel VBA 中同时对多个列运行自动筛选?

node.js - 尝试将Nightmare .type()与只能通过名称获得的元素一起使用

python - 获取YouTube视频列表的评论数