r - 在 R 中,按具有客队和主队的体育数据分组 - 一个常见的挫败感

标签 r dplyr

我经常在 R 中处理体育数据,并在尝试计算汇总统计数据时遇到与 dplyr::group_by() 相同的问题。我有以下数据框,其中包含世界杯小组赛每场比赛中的预测点数:

dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal", 
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany", 
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland", 
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France", 
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic", 
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia", 
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland", 
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan", 
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia", 
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark", 
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic", 
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco", 
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica", 
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama", 
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco", 
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany", 
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765, 
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741, 
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816, 
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264, 
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665, 
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932, 
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787, 
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251, 
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294, 
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608, 
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778, 
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503, 
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA", 
"epB"), class = "data.frame", row.names = c(NA, -48L))

head(worldcup.df)
  teamA_name   teamB_name    epA    epB
1     Russia Saudi Arabia 1.6400 1.0940
2      Egypt      Uruguay 0.7051 2.0809
3    Morocco         Iran 1.1294 1.6016
4   Portugal        Spain 1.1116 1.6204
5     France    Australia 2.1962 0.6098
6  Argentina      Iceland 1.9840 0.7870

我已经计算了 epA 和 epB 作为 A 队和 B 队在每场比赛中的预期得分,现在我想做一个 group_by() 来计算 32 支球队中每支球队的总预期得分。我在历史上所做的事情是这样的:
asAgroupby = worldcup.df %>% 
  dplyr::group_by(teamA_name) %>%
  dplyr::summarise(totPts = sum(epA))

asBgroupby = worldcup.df %>% 
  dplyr::group_by(teamB_name) %>%
  dplyr::summarise(totPts = sum(epB))

outputdf = asAgroupby %>%
  dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
  dplyr::mutate(totPts = totPts.x + totPts.y) %>%
  dplyr::select(-one_of(c('totPts.x', 'totPts.y')))

2 个单独的 group_by() 调用,对于 teamA 和 teamB 列中的每一个,然后是 left_join,然后对列求和并删除多余的列......哎呀。这就像这个问题一样简单:正好 4 列(2 个识别列和 2 个统计列)。由于大量的体育数据都有主客队的列,这是一个常见问题。

我觉得我需要 1 个数据框,行数是 2 倍,列数是 1/2,这样我就可以做一组了。任何帮助表示赞赏,谢谢!!!

编辑:worldcup.df 是由长 %>% 的 dplyr 函数构建的 - 如果这可以在不创建新变量的情况下完成,则加分,而只是:
worldcup.df %>%
  ... 

最佳答案

这是一个 tidyverse通过将数据重新格式化为长格式来工作的工作流。它确实会跟踪谁在同一场比赛中( game_id ),以及他们是 A 队还是 B 队——如果这有用的话。 (平心而论,这与@Emil 的基本思想相同,只​​是实现它的工作流程不同。)

worldcup.long <- worldcup.df %>% 
  as_data_frame() %>%
  mutate(game_id = 1:n()) %>%
  gather(key, value, - game_id) %>%
  mutate(
    AB = str_extract(key, "A|B"),
    key = str_extract(key, "team|ep")
  ) %>%
  spread(key, value,convert = TRUE) 

outputdf <- worldcup.long %>%
  group_by(team) %>%
  summarize(totPts = sum(ep))

关于r - 在 R 中,按具有客队和主队的体育数据分组 - 一个常见的挫败感,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50707247/

相关文章:

R 函数 : Print Warning Only on First Call of Function

python - 将数据从 R 网状化到 python 并再次返回到 R

r - as.integer(8952) = 8951?

r - 仅使用函数的非缺失参数应用过滤器

r - 有没有办法在自己的函数中使用 mutate ?

R - 运行管道运算符(operator)的 t 检验

r-如何在highcharter图中设置xlim和ylim范围?

r - 将数据框和列表连接到包含列表列的数据框中

将多列中的行 id 的值替换为 dplyr case_when

r - dplyr:使用滚动时间窗口对数据进行分组和汇总/变异