我在尝试下载 PDF 时遇到了 cookie。
例如,如果我有一个 DOI对于考古数据服务上的 PDF 文档,它将解析为 this landing page
带有 embedded link in it to this pdf但真正重定向到 this其他链接。library(httr)
将处理解析 DOI,我们可以使用 library(XML)
从着陆页中提取 pdf URL。但我一直在获取 PDF 本身。
如果我这样做:
download.file("http://archaeologydataservice.ac.uk/archiveDS/archiveDownload?t=arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf", destfile = "tmp.pdf")
然后我收到一个与 http://archaeologydataservice.ac.uk/myads/ 相同的 HTML 文件
在 How to use R to download a zipped file from a SSL page that requires cookies 尝试答案引导我到这个:
library(httr)
terms <- "http://archaeologydataservice.ac.uk/myads/copyrights"
download <- "http://archaeologydataservice.ac.uk/archiveDS/archiveDownload"
values <- list(agree = "yes", t = "arch-1352-1/dissemination/pdf/Dyfed/GL44004.pdf")
# Accept the terms on the form,
# generating the appropriate cookies
POST(terms, body = values)
GET(download, query = values)
# Actually download the file (this will take a while)
resp <- GET(download, query = values)
# write the content of the download to a binary file
writeBin(content(resp, "raw"), "c:/temp/thefile.zip")
但是在
POST
之后和 GET
函数我只是得到了我用 download.file
得到的同一个 cookie 页面的 HTML。 :> GET(download, query = values)
Response [http://archaeologydataservice.ac.uk/myads/copyrights?from=2f6172636869766544532f61726368697665446f776e6c6f61643f61677265653d79657326743d617263682d313335322d3125324664697373656d696e6174696f6e2532467064662532464479666564253246474c34343030342e706466]
Date: 2016-01-06 00:35
Status: 200
Content-Type: text/html;charset=UTF-8
Size: 21 kB
<?xml version='1.0' encoding='UTF-8' ?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "h...
<html xmlns="http://www.w3.org/1999/xhtml" lang="en">
<head>
<meta http-equiv="Content-Type" content="text/html; c...
<title>Archaeology Data Service: myADS</title>
<link href="http://archaeologydataservice.ac.uk/css/u...
...
看着 http://archaeologydataservice.ac.uk/about/Cookies看来本站的cookie情况比较复杂。对于英国数据提供商来说,这种 cookie 复杂性似乎并不少见:automating the login to the uk data service website in R with RCurl or httr
我如何使用 R 来绕过本网站上的 cookie?
最佳答案
您在 rOpenSci 上的请求听说过!
这些页面之间有很多 javascript,这使得尝试通过 httr
破译有点烦人。 + rvest
.试试 RSelenium
.这适用于 OS X 10.11.2、R 3.2.3 和 Firefox 加载。
library(RSelenium)
# check if a sever is present, if not, get a server
checkForServer()
# get the server going
startServer()
dir.create("~/justcreateddir")
setwd("~/justcreateddir")
# we need PDFs to download instead of display in-browser
prefs <- makeFirefoxProfile(list(
`browser.download.folderList` = as.integer(2),
`browser.download.dir` = getwd(),
`pdfjs.disabled` = TRUE,
`plugin.scan.plid.all` = FALSE,
`plugin.scan.Acrobat` = "99.0",
`browser.helperApps.neverAsk.saveToDisk` = 'application/pdf'
))
# get a browser going
dr <- remoteDriver$new(extraCapabilities=prefs)
dr$open()
# go to the page with the PDF
dr$navigate("http://archaeologydataservice.ac.uk/archives/view/greylit/details.cfm?id=17755")
# find the PDF link and "hit ENTER"
pdf_elem <- dr$findElement(using="css selector", "a.dlb3")
pdf_elem$sendKeysToElement(list("\uE007"))
# find the ACCEPT button and "hit ENTER"
# that will save the PDF to the default downloads directory
accept_elem <- dr$findElement(using="css selector", "a[id$='agreeButton']")
accept_elem$sendKeysToElement(list("\uE007"))
现在等待下载完成。 R 控制台在下载时不会很忙,因此很容易在下载完成之前意外关闭 session 。
# close the session
dr$close()
关于r - 使用 R 接受 cookie 以下载 PDF 文件,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34623816/