r - 如何使用 R 在我的数据中找到最常见的序列?

标签 r sequence zoo rollapply

我正在尝试弄清楚如何使用 rollapply 函数(来自 Zoo 包)在数据集中查找最常见字符串的序列,但我还需要对某些变量进行分组(例如日期、行等)

在我继续之前,值得注意的是这个查询建立在我之前发布在这里的一个问题上:How can I find most common sequences (of strings) in my data using Tableau?

那里提供的解决方案非常有效,但我现在想将它应用于不同的数据集,这带来了一些新的挑战!这是我在这个新数据集中使用的数据示例:

structure(list(Title = c("Dragons' Den", "One Hot Summer", "Keeping Faith", 
"Cuckoo", "Match of the Day", "Sportscene", "Sportscene", "The Irish League Show", 
"Match of the Day", "EastEnders", "Dragons' Den", "Fake or Fortune?", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "Hidden", "Train Surfing Wars: A Matter of Life and Death", 
"Bollywood: The World's Biggest Film Industry", "One Hot Summer", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Match of the Day", 
"Dragons' Den", "The Next Step", "Doctor Who Series 11 Trailer", 
"Doctor Who", "Doctor Who", "Doctor Who", "Picnic at Hanging Rock", 
"Sylvia", "Keeping Faith", "Cardinal: Blackfly Season", "Picnic at Hanging Rock", 
"Age Before Beauty", "One Hot Summer", "Stewart Lee's Comedy Vehicle", 
"Asian Provocateur", "In The Flesh", "Two Pints of Lager and a Packet of Crisps", 
"Travels in Trumpland with Ed Balls", "EastEnders", "Age Before Beauty", 
"Holby City", "Who Do You Think You Are?", "Louis Theroux: Dark States", 
"Louis Theroux: Dark States", "Louis Theroux", "Louis Theroux's Weird Weekends", 
"Picnic at Hanging Rock", "Sylvia", "Keeping Faith", "Cardinal: Blackfly Season"
), Programme_Genre = c("Entertainment", "Documentary", "Drama", 
"New SeriesComedy", "Sport", "Sport", "Sport", "Sport", "Sport", 
"Drama", "Entertainment", "Documentary", "Comedy", "Drama", "Comedy", 
"Documentary", "Crime Drama", "Documentary", "Documentary", "Documentary", 
"Comedy", "Drama", "Comedy", "Documentary", "Drama", "Sport", 
"Entertainment", "CBBC", "Sci-Fi", "Sci-Fi", "Sci-Fi", "Sci-Fi", 
"Drama", "Film", "Drama", "Crime Drama", "On Now", "Drama", "Documentary", 
"Comedy", "Comedy", "Drama", "Comedy", "Documentary", "Drama", 
"Drama", "Drama", "History", "Documentary", "Documentary", "Documentary", 
"Archive", "Drama", "Film", "Drama", "Crime Drama"), Programme_Category = c("Featured", 
"Featured", "Featured", "Featured", "This Weekend's Football", 
"This Weekend's Football", "This Weekend's Football", "This Weekend's Football", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Box Sets", "Box Sets", "Box Sets", "Box Sets", "Featured", "Featured", 
"Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", "Box Sets", 
"Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Doctor Who S1-S10", "Doctor Who S1-S10", "Doctor Who S1-S10", 
"Doctor Who S1-S10", "Drama", "Drama", "Drama", "Drama", "Featured", 
"Featured", "Featured", "Featured", "Box Sets", "Box Sets", "Box Sets", 
"Box Sets", "Most Popular", "Most Popular", "Most Popular", "Most Popular", 
"Louis Theroux", "Louis Theroux", "Louis Theroux", "Louis Theroux", 
"Drama", "Drama", "Drama", "Drama"), date = c("13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", 
"13/08/2018", "13/08/2018", "13/08/2018", "13/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", 
"14/08/2018", "14/08/2018", "14/08/2018", "14/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018", 
"15/08/2018", "15/08/2018", "15/08/2018", "15/08/2018"), column = c("1", 
"2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", 
"3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", 
"4", "1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", 
"1", "2", "3", "4", "1", "2", "3", "4", "1", "2", "3", "4", "1", 
"2", "3", "4"), row = c("1", "1", "1", "1", "2", "2", "2", "2", 
"3", "3", "3", "3", "4", "4", "4", "4", "1", "1", "1", "1", "2", 
"2", "2", "2", "3", "3", "3", "3", "4", "4", "4", "4", "5", "5", 
"5", "5", "1", "1", "1", "1", "2", "2", "2", "2", "3", "3", "3", 
"3", "4", "4", "4", "4", "5", "5", "5", "5")), row.names = c(NA, 
-56L), class = "data.frame") 

抱歉,我不太确定共享数据的最佳做法。希望以上工作。它应该看起来像这样:

   Title            Programme_Genre     Programme_Category  date         column row
1   Dragons Den     Entertainment       Featured            13/08/2018      1   1
2  One Hot Summer   Documentary         Featured            13/08/2018      2   1
3  Keeping Faith    Drama               Featured            13/08/2018      3   1
4  Cuckoo           New Series Comedy   Featured            13/08/2018      4   1
5  Match of the Day Sport               This Weekends...    13/08/2018      1   2
6  Sportscene       Sport               This Weekends...    13/08/2018      2   2

我想做的是使用 rollapply 函数,类似于我在上一个问题中建议的方式(参见上面的链接),但仅用于查找出现在同一日期和跨度的序列一定范围的列。例如,我想知道最常见的流派序列(“Programme_Genre”)是什么,但我只希望 rollapply 函数在每个日期的每一行的第 1-4 列中执行此操作。我确定我没有很好地解释这一点(我不是来自数据科学背景,以防你没有猜到)所以我很乐意在必要时详细说明。提前致谢!

最佳答案

使用 tidyverse、zoo 和 lubridate,尝试:

library(tidyverse)
library(zoo)
library(lubridate)

df %>% 
  mutate(date = lubridate::dmy(date)) %>% # Optional. Properly parses date as Date class. Makes sorting easier.
  filter(column <= 4) %>% # Step 1. Exclude observations with `column` values above 4.
  group_split(row, date) %>% # Step 2. Splits the DF into smaller DFs representing row and date groups.
  # Step 3 (below). Loops the solution to the previous question, gets a DF, and assigns the date and row signals to each observation.
  map_df(.x = . ,
         .f = ~(rollapply(data = .x$Programme_Genre , 3, c) %>% 
                  as_tibble() %>% 
                  mutate(date = unique(.x$date), row = unique(.x$row)))) %>% 
  group_by_all() %>% 
  tally() %>% 
  arrange(date, row, n)

    # A tibble: 26 x 6
# Groups:   V1, V2, V3, date [26]
   V1            V2            V3               date       row       n
   <chr>         <chr>         <chr>            <date>     <chr> <int>
 1 Documentary   Drama         New SeriesComedy 2018-08-13 1         1
 2 Entertainment Documentary   Drama            2018-08-13 1         1
 3 Sport         Sport         Sport            2018-08-13 2         2
 4 Drama         Entertainment Documentary      2018-08-13 3         1
 5 Sport         Drama         Entertainment    2018-08-13 3         1
 6 Comedy        Drama         Comedy           2018-08-13 4         1
 7 Drama         Comedy        Documentary      2018-08-13 4         1
 8 Crime Drama   Documentary   Documentary      2018-08-14 1         1
 9 Documentary   Documentary   Documentary      2018-08-14 1         1
10 Comedy        Drama         Comedy           2018-08-14 2         1
# ... with 16 more rows

关于r - 如何使用 R 在我的数据中找到最常见的序列?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/67218140/

相关文章:

r - 如何避免类名与包 "loaded via a namespace (and not attached)"(qdap&openssl)冲突

r - 如何使用列表指定排序顺序?

python - 根据序列中缺失的数字拆分列表

R,用零替换 NA

r - 根据日期计算数据表中的前几行

r - 在 levelplot 中使用布局函数

r - 应用程序内部的函数找不到在应用程序启动时加载的对象

c++ - 纠正cout执行

javascript - 如何在javascript中从左到右从字符串序列中选择数字整数

r - 在 ifelse() 语句内部和外部运行一行时的不同输出