r - 使用 R 接受 cookie 以下载 PDF 文件

我在尝试下载 PDF 时遇到了 cookie。

例如，如果我有一个 DOI对于考古数据服务上的 PDF 文档，它将解析为 this landing page
带有 embedded link in it to this pdf但真正重定向到 this其他链接。
library(httr)将处理解析 DOI，我们可以使用 library(XML) 从着陆页中提取 pdf URL。但我一直在获取 PDF 本身。

如果我这样做:

download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")

然后我收到一个与 http://archaeologydataservice.ac.uk/myads/ 相同的 HTML 文件

在 How to use R to download a zipped file from a SSL page that requires cookies 尝试答案引导我到这个:

library(httr)

terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")

# Accept the terms on the form,
# generating the appropriate cookies

POST(terms, body = values)
GET(download, query = values)

# Actually download the file (this will take a while)

resp <- GET(download, query = values)

# write the content of the download to a binary file

writeBin(content(resp, "raw"), "c:/temp/thefile.zip")

但是在POST之后和 GET函数我只是得到了我用 download.file 得到的同一个 cookie 页面的 HTML。 :

> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
  Date: 2016-01-06 00:35
  Status: 200
  Content-Type: text/html;charset=UTF-8
  Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
        <head>
            <meta http-equiv="Content-Type" content="text/html; c...


            <title>Archaeology Data Service:  myADS</title>

            <link href="http://archaeologydataservice.ac.uk/css/u...
...

看着 http://archaeologydataservice.ac.uk/about/Cookies看来本站的cookie情况比较复杂。对于英国数据提供商来说，这种 cookie 复杂性似乎并不少见:automating the login to the uk data service website in R with RCurl or httr

我如何使用 R 来绕过本网站上的 cookie？

最佳答案

您在 rOpenSci 上的请求听说过!

这些页面之间有很多 javascript，这使得尝试通过 httr 破译有点烦人。 + rvest .试试 RSelenium .这适用于 OS X 10.11.2、R 3.2.3 和 Firefox 加载。

library(RSelenium)

# check if a sever is present, if not, get a server
checkForServer()

# get the server going
startServer()

dir.create("~/justcreateddir")
setwd("~/justcreateddir")

# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
  `browser.download.folderList` = as.integer(2),
  `browser.download.dir` = getwd(),
  `pdfjs.disabled` = TRUE,
  `plugin.scan.plid.all` = FALSE,
  `plugin.scan.Acrobat` = "99.0",
  `browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()

# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")

# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))

# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))

现在等待下载完成。 R 控制台在下载时不会很忙，因此很容易在下载完成之前意外关闭 session 。

# close the session
dr$close()

关于r - 使用 R 接受 cookie 以下载 PDF 文件，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/34623816/

r - 使用 R 接受 cookie 以下载 PDF 文件

上一篇：jupyter-notebook - Jupyter笔记本电脑: memory usage for each notebook

下一篇：spam-prevention - 可以做些什么来防止类似论坛的应用程序中的垃圾邮件？