html - 使用 R 查找给定相对 href 的绝对 html 路径

标签 html r hyperlink rvest

我是 html 新手，但正在使用脚本来下载给定网页链接到的所有 PDF 文件(为了好玩并避免无聊的手动工作)并且我无法在 html 文档中找到我应该查找完成相对路径的数据 - 我知道这是可能的，因为我的网络浏览器可以做到这一点。

示例:我试图抓取链接到 this page from ocw.mit.edu 的讲义使用R包rvest查看原始html或访问a的href属性>“节点”我只得到相对路径:

library(rvest)
url <- paste0("https://ocw.mit.edu/courses/",
  "electrical-engineering-and-computer-science/",
  "6-006-introduction-to-algorithms-fall-2011/lecture-notes/")

# Read webpage and extract all links
links_all <- read_html(url)  %>% 
  html_nodes("a") %>%
  html_attr("href")

# Extract only href ending in "pdf"
links_pdf <- grep("pdf$", tolower(links_all), value = TRUE)
links_pdf[1] 
[1] "/courses/electrical-engineering-and-computer-science/6-006-introduction-to-algorithms-fall-2011/lecture-videos/mit6_006f11_lec01.pdf"

最佳答案

我今天发现的最简单的解决方案是使用 xml2 包的 url_absolute(x, base) 函数。对于基本参数，您可以使用从中检索源的页面的 URL。

这似乎比尝试通过正则表达式提取地址的基本 URL 更不容易出错。

关于html - 使用 R 查找给定相对 href 的绝对 html 路径，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/47498461/

上一篇：design-patterns - 聚合必须知道并仅将其行为建立在其自身状态之上？聚合可以在其行为(方法)中使用其他聚合的状态吗？

下一篇：typescript - 在 typescript 中将大接口(interface)转换为小接口(interface)

javascript - 伪元素离我的 DIV 很远

html - 在 div 容器中将文本居中覆盖在图像上

javascript - 混合内容奇怪的 https 问题与 https iframe

r - ShinyTree:如果选中复选框，则将变量设置为值

r - 如何访问检查包裹时可能出现的任何注释？

iOS >> 广告转化链接 : Is there a way to "catch" from which site/banner/ad users got to my app in the App Store?

html - 为什么悬停颜色不遵循 :hover in a div?

r - R 中的 apcluster : Memory limitation

Silverlight 使用 HyperlinkButton 在页面之间传递参数