r - 检查 R 中的 URL 是否为 "exist"

我正在尝试检查 R 中是否“存在”大量 URL 列表。如果您能提供帮助，请告诉我!

我的目标:我正在尝试检查《今日心理学》在线治疗师目录中的 URL 是否存在。我有一个包含该目录中许多可能的 URL 的数据框。其中有些确实存在，但有些则不存在。当 URL 不存在时，它们会返回通用的《今日心理学》在线网站。

例如，存在以下 URL:“https://www.psychologytoday.com/us/therapys/new-york/a?page=10”。这是姓氏以“A”开头的纽约治疗师的第十页。至少有 10 页纽约治疗师的名字以“A”开头，因此该页面存在。

但是，此 URL 不存在:“https://www.psychologytoday.com/us/therapys/new-york/a?page=119”。纽约的 119 页中，姓氏以“A”开头的治疗师并不多。因此，《今日心理学》网站会将您重定向到一个通用网站:“https://www.psychologytoday.com/us/therapys/new-york/a”。

我的最终目标是获得姓氏以“A”开头的纽约治疗师的所有页面的完整列表(然后我将对其他字母等重复此操作)。 )。

关于此主题的上一篇文章:之前有一篇关于此主题的 StackOverflow 文章 ( Check if URL exists in R )，我已经实现了这篇文章中的解决方案。然而，上一篇文章中的每个解决方案都会错误地报告我感兴趣的特定 URL不存在，即使它们确实存在!

我的代码:我已尝试使用以下代码来检查这些网址是否存在。这两个代码解决方案均取自有关该主题的先前帖子(上面链接)。然而，这两个代码解决方案都告诉我，《今日心理学》中确实存在的 URL 并不存在。我不知道这是为什么!

加载包:

### Load packages and set user agent
pacman::p_load(dplyr, tidyr, stringr, tidyverse, RCurl, pingr)

# Set alternative user agent globally for whole session
options(HTTPUserAgent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")

# Check user agent string again
options("HTTPUserAgent")

仅保留“真实”的 URL:RCurl 解决方案

url.exists("https://www.psychologytoday.com/us/therapists/new-york/a?page=3")

结果:即使此页面确实存在，此解决方案也会返回“FALSE”!

仅保留“真实”的目录页面 URL:StackExchange 帖子评论 #1 解决方案

### Function for checking if URLs are "real"
  # From StackOverflow: https://stackoverflow.com/questions/52911812/check-if-url-exists-in-r
#' @param x a single URL
#' @param non_2xx_return_value what to do if the site exists but the
#'        HTTP status code is not in the `2xx` range. Default is to return `FALSE`.
#' @param quiet if not `FALSE`, then every time the `non_2xx_return_value` condition
#'        arises a warning message will be displayed. Default is `FALSE`.
#' @param ... other params (`timeout()` would be a good one) passed directly
#'        to `httr::HEAD()` and/or `httr::GET()`
url_exists <- function(x, non_2xx_return_value = FALSE, quiet = FALSE,...) {

  suppressPackageStartupMessages({
    require("httr", quietly = FALSE, warn.conflicts = FALSE)
  })

  # you don't need thse two functions if you're alread using `purrr`
  # but `purrr` is a heavyweight compiled pacakge that introduces
  # many other "tidyverse" dependencies and this doesnt.

  capture_error <- function(code, otherwise = NULL, quiet = TRUE) {
    tryCatch(
      list(result = code, error = NULL),
      error = function(e) {
        if (!quiet)
          message("Error: ", e$message)

        list(result = otherwise, error = e)
      },
      interrupt = function(e) {
        stop("Terminated by user", call. = FALSE)
      }
    )
  }

  safely <- function(.f, otherwise = NULL, quiet = TRUE) {
    function(...) capture_error(.f(...), otherwise, quiet)
  }

  sHEAD <- safely(httr::HEAD)
  sGET <- safely(httr::GET)

  # Try HEAD first since it's lightweight
  res <- sHEAD(x, ...)

  if (is.null(res$result) || 
      ((httr::status_code(res$result) %/% 200) != 1)) {

    res <- sGET(x, ...)

    if (is.null(res$result)) return(NA) # or whatever you want to return on "hard" errors

    if (((httr::status_code(res$result) %/% 200) != 1)) {
      if (!quiet) warning(sprintf("Requests for [%s] responded but without an HTTP status code in the 200-299 range", x))
      return(non_2xx_return_value)
    }

    return(TRUE)

  } else {
    return(TRUE)
  }

}

### Create URL list
some_urls <- c("https://www.psychologytoday.com/us/therapists/new-york/a?page=10", # Exists
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=4", # Exists
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=140", # Does not exist
               "https://www.psychologytoday.com/us/therapists/new-york/a?page=3" # Exists
)

### Check if URLs exist
data.frame(
  exists = sapply(some_urls, url_exists, USE.NAMES = FALSE),
  some_urls,
  stringsAsFactors = FALSE
) %>% dplyr::tbl_df() %>% print()

结果:此解决方案对每个 URL 都返回“FALSE”，即使其中 4 个 URL 中确实存在 3 个!

如果您有任何建议，请告诉我!我非常感谢您提出的任何意见或建议。谢谢!

最佳答案

这两种解决方案都基于libcurl。 httr 的默认用户代理包括 Curl、RCurl 和 httr 版本。您可以使用详细模式检查它:

> httr::HEAD(some_urls[1], httr::verbose())
-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: libcurl/7.68.0 r-curl/4.3.2 httr/1.4.3    <<<<<<<<< Here is the problem. I think the site disallows webscraping. You need to check the related robots.txt file(s).
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 403 
<- date: Wed, 27 Jul 2022 20:56:28 GMT
<- content-type: text/html; charset=iso-8859-1
<- server: Apache/2.4.53 (Amazon)
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 20:56
  Status: 403
  Content-Type: text/html; charset=iso-8859-1
<EMPTY BODY>

您可以为每个函数调用设置用户代理 header 。我不知道这种情况下的全局选项方式:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> httr::HEAD(some_urls[1], user_agent, httr::verbose())

-> HEAD /us/therapists/new-york/a?page=10 HTTP/2
-> Host: www.psychologytoday.com
-> user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36
-> accept-encoding: deflate, gzip, br
-> cookie: summary_id=62e1a40279e4c
-> accept: application/json, text/xml, application/xml, */*
-> 
<- HTTP/2 200 
<- date: Wed, 27 Jul 2022 21:01:07 GMT
<- content-type: text/html; charset=utf-8
<- server: Apache/2.4.54 (Amazon)
<- x-powered-by: PHP/7.0.33
<- content-language: en-US
<- x-frame-options: SAMEORIGIN
<- expires: Wed, 27 Jul 2022 22:01:07 GMT
<- cache-control: private, max-age=3600
<- last-modified: Wed, 27 Jul 2022 21:01:07 GMT
<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly

NOTE: bunch of set-cookie deleted here

<- set-cookie: search-language=deleted; expires=Thu, 01-Jan-1970 00:00:01 GMT; Max-Age=0; path=/; secure; HttpOnly
<- via: 1.1 ZZ
<- 
Response [https://www.psychologytoday.com/us/therapists/new-york/a?page=10]
  Date: 2022-07-27 21:01
  Status: 200
  Content-Type: text/html; charset=utf-8
<EMPTY BODY>

注意:我没有调查 RCurl 的 url.exists。您需要以某种方式确保它使用正确的用户代理字符串。

简而言之，没有详细:

> user_agent <- httr::user_agent("Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36")
> (httr::status_code(httr::HEAD(some_urls[1], user_agent)) %/% 200) == 1
[1] TRUE
>

我认为您可以从这里编写自己的解决方案。

关于r - 检查 R 中的 URL 是否为 "exist"，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/73142972/

r - 检查 R 中的 URL 是否为 "exist"

上一篇：java - 为线程和数据库连接提供更多调试见解

下一篇：sql - 如何将非连续记录的行号重置为 1