xml - RCurl 无法下载 URL 内容

标签 xml r rcurl rvest

页面下载失败。这是我收到的错误:

Error in which(value == defs) : 
  argument "code" is missing, with no default

这是我的代码:

require(RCurl)
require(XML)

ok <- "http://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"

okc <- getURL(ok, encoding="UTF-8") #Download the page
okcHTML <- htmlParse(okc, asText = TRUE, encoding = "utf-8")

最佳答案

如果您愿意生活在 Hadleyverse 的最前沿,rvest 可以很好地处理这个问题:

library(rvest)

ok_search <- "https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50"

pg <- html_session(ok_search)
pg %>% html_nodes("div.profile_info") %>% html_text()

##  [1] "  phenombom   32·San Francisco, CA  "        "  sylvea   24·San Francisco, CA  "          
##  [3] "  haafu   40·San Francisco, CA  "            "  Rebamania   31·San Francisco, CA  "       
##  [5] "  Brilikedacheese   26·San Francisco, CA  "  "  cloudhunteress   23·San Francisco, CA  "  
##  [7] "  Lizzieisdizzy   28·San Francisco, CA  "    "  liddybird80   34·San Francisco, CA  "     
##  [9] "  wander_found   32·San Francisco, CA  "     "  Crunchyisinabox   31·San Francisco, CA  " 
...

我将探讨为什么直接 RCurl(rvest 包装 RCurl)不起作用。

更新

更深一层并使用 httr(另一个 RCurl 抽象):

library(httr)
library(XML)

res <- GET(ok_search)
ok_html <- content(res, as="parsed")
xpathSApply(ok_html, "//div[@class='profile_info']", xmlValue)

它返回与上面相同的结果,所以它也工作正常。

更新/已解决

library(RCurl)
library(XML)

okc <- getURL(ok,  followlocation=TRUE)
ok_html <- htmlParse(okc)
xpathSApply(ok_html , "//div[@class='profile_info']", xmlValue)

您需要添加followlocation=TRUE。原始 URL 导致 302 响应(服务器正在发送重定向)并且 RCurl 默认情况下不会遵循这些响应,但似乎 httrrvest 确保默认设置参数。

您可以在 getURL 上使用 verbose=TRUE 参数将响应视为控制台消息:

## * Adding handle: conn: 0x114ade000
## * Adding handle: send: 0
## * Adding handle: recv: 0
## * Curl_addHandleToPipeline: length: 1
## * - Conn 12 (0x114ade000) send_pipe: 1, recv_pipe: 0
## * About to connect() to www.okcupid.com port 80 (#12)
## *   Trying 198.41.209.131...
## * Connected to www.okcupid.com (198.41.209.131) port 80 (#12)
## > GET /match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50 HTTP/1.1
## User-Agent: curl/7.30.0 Rcurl/1.95.4.3
## Host: www.okcupid.com
## Accept: */*
## 
## < HTTP/1.1 302
## < Date: Mon, 20 Oct 2014 20:07:12 GMT
## < Content-Type: application/octet-stream
## < Transfer-Encoding: chunked
## < Connection: keep-alive
## < Set-Cookie: __cfduid=d0d55f2c9c990d97b0d02dba7148881741413835631999; expires=Mon, 23-Dec-2019 23:50:00 GMT; path=/; domain=.okcupid.com; HttpOnly
## < X-OKWS-Version: OKWS/3.1.30.2
## < Location: https://www.okcupid.com/match?filter1=0,34&filter2=2,22,40&filter3=3,5&filter4=5,3600&filter5=9,486&filter6=1,1&locid=4265540&lquery=San%20Francisco,%20California&timekey=1&matchOrderBy=MATCH&custom_search=0&fromWhoOnline=0&mygender=m&update_prefs=1&sort_type=0&sa=1&using_saved_search=&count=50
## < P3P: CP="NOI CURa ADMa DEVa TAIa OUR BUS IND UNI COM NAV INT", policyref="http://www.okcupid.com/w3c/p3p.xml"
## < X-XSS-Protection: 1; mode=block
## < Set-Cookie: guest=10834912674894888479; Expires=Tue, 20 Oct 2015 20:07:12 GMT; Path=/; Domain=okcupid.com; HttpOnly
## * Server cloudflare-nginx is not blacklisted
## < Server: cloudflare-nginx
## < CF-RAY: 17c7d71bf1880412-EWR
## < 
## * Connection #12 to host www.okcupid.com left intact

它在调试此类问题时非常有用。您也可以将 verbose() 参数用于 httrrvest URL 检索函数。

关于xml - RCurl 无法下载 URL 内容,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/26473611/

相关文章:

r - 分块处理数据

r - R中动物园对象的分位数和聚合

r - 用数字填充列的有效方法,这些数字可识别列中具有相同值的观测值

r - 使用 RCurl/httr 进行 Github 基本授权

xml - XPath 选择不同命名空间中的节点

Java XML 解析器添加不必要的 xmlns 和 xml :space attributes

mysql - 将xml文件中的数据插入mysql数据库

当 showAsAction 永远不会时,带有图标和文本的 Android 菜单项

r - 这个 curl 请求的 R 等价物是什么

r - 将 curl 命令转换为 Rcurl