xml - 将维基百科中的表格加载到 R

我正在尝试从以下 URL 将最高法院大法官表加载到 R 中。 https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States

我正在使用以下代码:

scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"
scotusData <- getURL(scotusURL, ssl.verifypeer = FALSE)
scotusDoc <- htmlParse(scotusData)
scotusData <- scotusDoc['//table[@class="wikitable"]']
scotusTable <- readHTMLTable(scotusData[[1]], stringsAsFactors = FALSE)

R 将 scotusTable 返回为 NULL。这里的目标是在 R 中获得一个 data.frame，我可以用它来制作 SCOTUS 司法任期在法庭上的 ggplot。我以前用脚本来制作一个很棒的情节，但是在最近的决定之后页面上发生了一些变化，现在脚本将无法运行。我浏览了维基百科上的 HTML 以试图找到任何更改，但我不是网络开发人员，所以任何会破坏我的脚本的东西都不会立即显现出来。

此外，R 中是否有一种方法可以让我缓存此页面的数据，这样我就不会经常引用 URL？这似乎是将来避免此问题的理想方式。感谢您的帮助。

顺便说一句，SCOTUS 在我正在进行的业余爱好/副项目中，所以如果有其他比维基百科更好的数据源，我会洗耳恭听。

编辑:抱歉，我应该列出我的依赖项。我正在使用 XML、plyr、RCurl、data.table 和 ggplot2 库。

最佳答案

如果您不介意使用不同的包，您可以尝试“rvest”包。

library(rvest)    
scotusURL <- "https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States"

选项 1:从页面中抓取表格并使用 html_table 函数提取您感兴趣的表格。

temp <- scotusURL %>% 
  html %>%
  html_nodes("table")

html_table(temp[1]) ## Just the "legend" table
html_table(temp[2]) ## The table you're interested in

选项 2:检查表元素并复制 XPath 以直接读取该表(右键单击，检查元素，滚动到相关的“表”标签，右键单击它，然后选择“复制 XPath” ).
```
scotusURL %>% 
  html %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/table[2]') %>% 
  html_table
```

我喜欢的另一个选择是将数据加载到 Google 电子表格中并使用 "googlesheets" package 读取它.

在 Google 云端硬盘中，创建一个名为“最高法院”的新电子表格。在第一个工作表中，输入:

=importhtml("https://en.wikipedia.org/wiki/List_of_Justices_of_the_Supreme_Court_of_the_United_States", "table", 2)

这会自动将此表格抓取到您的 Google 电子表格中。

从那里开始，您可以在 R 中执行以下操作:

library(googlesheets)
SC <- gs_title("Supreme Court")
gs_read(SC)

关于xml - 将维基百科中的表格加载到 R，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/31176709/

xml - 将维基百科中的表格加载到 R

上一篇：c# - 你怎么能在网页中编辑有效的 XML？

下一篇：xml - 处理格式错误的 XML