我经常在 R 中处理体育数据,并在尝试计算汇总统计数据时遇到与 dplyr::group_by() 相同的问题。我有以下数据框,其中包含世界杯小组赛每场比赛中的预测点数:
dput(worldcup.df)
structure(list(teamA_name = c("Russia", "Egypt", "Morocco", "Portugal",
"France", "Argentina", "Peru", "Croatia", "Costa Rica", "Germany",
"Brazil", "Sweden", "Belgium", "Tunisia", "Colombia", "Poland",
"Russia", "Portugal", "Uruguay", "Iran", "Denmark", "France",
"Argentina", "Brazil", "Nigeria", "Serbia", "Belgium", "Korea Republic",
"Germany", "England", "Japan", "Poland", "Uruguay", "Saudi Arabia",
"Iran", "Spain", "Denmark", "Australia", "Nigeria", "Iceland",
"Mexico", "Korea Republic", "Serbia", "Switzerland", "Japan",
"Senegal", "Panama", "England"), teamB_name = c("Saudi Arabia",
"Uruguay", "Iran", "Spain", "Australia", "Iceland", "Denmark",
"Nigeria", "Serbia", "Mexico", "Switzerland", "Korea Republic",
"Panama", "England", "Japan", "Senegal", "Egypt", "Morocco",
"Saudi Arabia", "Spain", "Australia", "Peru", "Croatia", "Costa Rica",
"Iceland", "Switzerland", "Tunisia", "Mexico", "Sweden", "Panama",
"Senegal", "Colombia", "Russia", "Egypt", "Portugal", "Morocco",
"France", "Peru", "Argentina", "Croatia", "Sweden", "Germany",
"Brazil", "Costa Rica", "Poland", "Colombia", "Tunisia", "Belgium"
), epA = c(1.64, 0.7051, 1.1294, 1.1116, 2.1962, 1.984, 1.5765,
1.865, 1.2845, 2.0889, 2.1384, 1.5034, 2.1706, 0.5859, 2.1741,
1.6272, 1.4941, 2.1482, 2.2089, 0.635, 1.7694, 1.6016, 1.7816,
2.4745, 1.0762, 1.0326, 2.198, 1.0414, 2.2583, 2.198, 1.1264,
1.0471, 1.9565, 1.2201, 0.8364, 2.3633, 0.9337, 0.7922, 0.5665,
1.1593, 1.5544, 0.4698, 0.4331, 1.7843, 0.8872, 0.8157, 1.3932,
1.3932), epB = c(1.094, 2.0809, 1.6016, 1.6204, 0.6098, 0.787,
1.1535, 0.89, 1.4405, 0.6981, 0.6576, 1.2226, 0.6304, 2.2251,
0.6279, 1.1058, 1.2319, 0.6488, 0.5991, 2.165, 0.9756, 1.1294,
0.9644, 0.3895, 1.6588, 1.7064, 0.608, 1.6966, 0.5597, 0.608,
1.6046, 1.6909, 0.8105, 1.5069, 1.9266, 0.4757, 1.8163, 1.9778,
2.2495, 1.5697, 1.1746, 2.3712, 2.4179, 0.9617, 1.8688, 1.9503,
1.3308, 1.3308)), .Names = c("teamA_name", "teamB_name", "epA",
"epB"), class = "data.frame", row.names = c(NA, -48L))
head(worldcup.df)
teamA_name teamB_name epA epB
1 Russia Saudi Arabia 1.6400 1.0940
2 Egypt Uruguay 0.7051 2.0809
3 Morocco Iran 1.1294 1.6016
4 Portugal Spain 1.1116 1.6204
5 France Australia 2.1962 0.6098
6 Argentina Iceland 1.9840 0.7870
我已经计算了 epA 和 epB 作为 A 队和 B 队在每场比赛中的预期得分,现在我想做一个 group_by() 来计算 32 支球队中每支球队的总预期得分。我在历史上所做的事情是这样的:
asAgroupby = worldcup.df %>%
dplyr::group_by(teamA_name) %>%
dplyr::summarise(totPts = sum(epA))
asBgroupby = worldcup.df %>%
dplyr::group_by(teamB_name) %>%
dplyr::summarise(totPts = sum(epB))
outputdf = asAgroupby %>%
dplyr::left_join(asBgroupby, by = c('teamA_name'='teamB_name')) %>%
dplyr::mutate(totPts = totPts.x + totPts.y) %>%
dplyr::select(-one_of(c('totPts.x', 'totPts.y')))
2 个单独的 group_by() 调用,对于 teamA 和 teamB 列中的每一个,然后是 left_join,然后对列求和并删除多余的列......哎呀。这就像这个问题一样简单:正好 4 列(2 个识别列和 2 个统计列)。由于大量的体育数据都有主客队的列,这是一个常见问题。
我觉得我需要 1 个数据框,行数是 2 倍,列数是 1/2,这样我就可以做一组了。任何帮助表示赞赏,谢谢!!!
编辑:worldcup.df 是由长 %>% 的 dplyr 函数构建的 - 如果这可以在不创建新变量的情况下完成,则加分,而只是:
worldcup.df %>%
...
最佳答案
这是一个 tidyverse
通过将数据重新格式化为长格式来工作的工作流。它确实会跟踪谁在同一场比赛中( game_id
),以及他们是 A 队还是 B 队——如果这有用的话。 (平心而论,这与@Emil 的基本思想相同,只是实现它的工作流程不同。)
worldcup.long <- worldcup.df %>%
as_data_frame() %>%
mutate(game_id = 1:n()) %>%
gather(key, value, - game_id) %>%
mutate(
AB = str_extract(key, "A|B"),
key = str_extract(key, "team|ep")
) %>%
spread(key, value,convert = TRUE)
outputdf <- worldcup.long %>%
group_by(team) %>%
summarize(totPts = sum(ep))
关于r - 在 R 中,按具有客队和主队的体育数据分组 - 一个常见的挫败感,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/50707247/