r - 如何摆脱错误 : Tibble columns must have compatible sizes?

标签 r rvest tibble

一位社区成员帮助我编写了以下代码:

library(rvest)
library(tidyverse)

get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    author = page %>%
      html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

df <- map_dfr(seq(0, 200, by = 20), get_articles)

但是当我尝试运行它时,我收到以下错误:

! Tibble columns must have compatible sizes. • Size 20: Existing data. • Size 21: Column author. ℹ Only values of size one are recycled.

我已经在此处搜索了解决方案,但未能从中获得太多意义。如果有任何帮助,我将不胜感激。

最佳答案

由于代码中的 author 返回 url 中所有作者的列表,并且某些文章有多个作者,因此该函数返回的作者多于文章。 dataframetibble 的每一列中的元素数量必须相同。

例如,这会引发类似的错误

tibble::tibble(url = 1:3, author = 1:4)
#> Error: Tibble columns must have compatible sizes.
#> * Size 3: Existing data.
#> * Size 4: Column `author`.
#> i Only values of size one are recycled.

一种选择是在阅读每篇文章的内容时将作者姓名的检索推到下一步。请注意第 10 个 url 链接到没有文章正文的视频,因此它不返回任何 content

library(rvest)
library(tidyverse)


get_articles <- function(n_articles) {
  page <- paste0("https://www.theroot.com/news/criminal-justice",
                 "?startIndex=",
                 n_articles) %>%
    read_html()
  
  tibble(
    title = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_text2(),
    date = page %>%
      html_elements(".js_meta-time") %>%
      html_text2(),
    url = page %>%
      html_elements(".aoiLP .js_link") %>%
      html_attr("href")
  )
}

#df <- map_dfr(seq(0, 200, by = 20), get_articles)
df <- map_dfr(0, get_articles) #small example


df %>%
  slice(1:10) %>% # subset 10 rows for example
  mutate(html = map(url, read_html),
         content = map(html, ~ .x %>%
                         html_elements(".bOfvBY") %>%
                         html_text2 %>% 
                         paste(collapse = ",")),
         author = map(html, ~ .x %>%
                        html_elements(".llHfhX .js_link , .permalink-bylineprop") %>%
                        html_text2() %>%
                        set_names(paste0('author', 1:length(.))) #name the elements, which will become column names
                      )
         ) %>%
  unnest(content) %>%
  unnest_wider(author)
#> # A tibble: 10 x 7
#>    title          date    url            html  content         author1  author2 
#>    <chr>          <chr>   <chr>          <lis> <chr>           <chr>    <chr>   
#>  1 "US Soldier S~ Today ~ https://www.t~ <xml~ "A US soldier ~ Kalyn W~ <NA>    
#>  2 "South Caroli~ Yester~ https://www.t~ <xml~ "On Tuesday, a~ Jessica~ <NA>    
#>  3 "Abortion is ~ Tuesda~ https://www.t~ <xml~ "Abortion is o~ Jessica~ <NA>    
#>  4 "Pennsylvania~ 9/02/2~ https://www.t~ <xml~ "Pennsylvania ~ Kalyn W~ <NA>    
#>  5 "UN Committee~ 9/02/2~ https://www.t~ <xml~ "The devolving~ Jessica~ <NA>    
#>  6 "DA Fani Will~ 8/30/2~ https://www.t~ <xml~ "There continu~ Murjani~ <NA>    
#>  7 "How to Prote~ 8/30/2~ https://www.t~ <xml~ "The decision ~ Jessica~ <NA>    
#>  8 "26 Alleged G~ 8/29/2~ https://www.t~ <xml~ "Twenty-six pe~ Keith R~ <NA>    
#>  9 "Judge Angere~ 8/29/2~ https://www.t~ <xml~ "Sullivan Walt~ Kalyn W~ <NA>    
#> 10 "Small Town H~ 8/27/2~ https://www.t~ <xml~ ""              Kalyn W~ Adriano~

reprex package 创建于 2022-09-08 (v2.0.0)

关于r - 如何摆脱错误 : Tibble columns must have compatible sizes?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/73628499/

相关文章:

r - r 中的卡方 p 值矩阵

r - 在 R 中构建正交向量的最简单方法

R - 使用 rvest 包进行抓取

r - map 函数 R 中的进度条 - 网页抓取

r - 将一列的所有值汇总到一个向量中

r - 通过选择正确的值来合并数据框

r - R中函数的计算周期

xml - RCurl 无法下载 URL 内容

r - 基于映射和用户数据创建新的 tibble 列

r - 如何将元数据添加到小标题