我尝试使用以下循环从 IMDB 中抓取日期、标题和评论:
library(rvest)
library(dplyr)
library(stringr)
library(tidyverse)
ID <- 4633694
data <- lapply(paste0('http://www.imdb.com/title/tt', ID, '/reviews?filter=prolific', 1:20),
function(url){
url %>% read_html() %>%
html_nodes(".review-date,.rating-other-user-rating,.title,.show-more__control") %>%
html_text() %>%
gsub('[\r\n\t]', '', .)
})
其中提供了 20 页的评论数据,格式如下,重复相同的模式:
col1
1 10/10
2 If this was..
3 14 December 2018
4 I have to say, and no...
5
6
7 10/10
8 Stan Lee Is Smiling Right Now...
9 17 December 2018
10 A movie worthy of...
11
12
13 10/10
14 the most visually stunning film I've ever seen...
15 20 December 2018
16 There's hardly anything...
17.
18.
我想知道是否有一种方法可以将每 4 行转置为单独的列,以便每个属性在适当的列中对齐,如下所示:
Date Rating Title Review
1. 14 December 2018 10/10 If this was.. I have to...
2. 17 December 2018 10/10 Stan Lee Is... A movie worthy...
3. 20 December 2018 10/10 the most visually.. There's hardly anything...
最佳答案
text_data = gsub('\\b(\\d+/\\d+)\\b','\n\\1',paste(grep('\\w',x$col1,value = TRUE),collapse = ':'))
read.csv(text=text_data,h=F,sep=":",strip.white = T,fill=T,stringsAsFactors = F)
V1 V2 V3 V4 V5
1 10/10 If this was.. 14 December 2018 I have to say, and no... NA
2 10/10 Stan Lee Is Smiling Right Now... 17 December 2018 A movie worthy of... NA
3 10/10 the most visually stunning film I've ever seen... 20 December 2018 There's hardly anything... NA
关于r - 将每 4 行转置为 4 个单独的列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54730453/