我有一个包含多个分类变量的数据集
data <- data_frame(
HomeTeam = c("Team1", "Team2", "Team3", "Team4", "Team2", "Team2", "Team4",
"Team3", "Team2", "Team1", "Team3", "Team2"),
AwayTeam = c("Team2", "Team1", "Team4", "Team3", "Team1", "Team4", "Team1",
"Team2", "Team3", "Team3", "Team4", "Team1"),
HomeScore = c(10, 5, 12, 18, 17, 19, 23, 17, 34, 19, 8, 3),
AwayScore = c(4, 16, 9, 19, 16, 4, 8, 21, 6, 5, 9, 17),
Venue = c("Ground1", "Ground2", "Ground3", "Ground3", "Ground1", "Ground2",
"Ground1", "Ground3", "Ground2", "Ground3", "Ground4", "Ground2"))
我基本上想通过计数将“HomeTeam”和“AwayTeam”汇总到一个新表中,如下所示
HomeTeam NumberOfGamesHome NumberOfGamesaWAY
<chr> <int> <int>
1 Team1 2 4
2 Team2 5 2
3 Team3 3 3
4 Team4 2 3
我当前的方法需要两行分组代码,然后连接表
HomeTeamCount <- data %>%
group_by(HomeTeam) %>%
summarise(NumberOfGamesHome = n())
AwayTeamCount <- data %>%
group_by(AwayTeam) %>%
summarise(NumberOfGamesAway = n())
Desired <- left_join(HomeTeamCount, AwayTeamCount,
by = c("HomeTeam" = "AwayTeam"))
在我的实际数据集中,我有大量的分类变量,遵循上述方法似乎费力且低效
有没有办法使用 dplyr 对多个分类变量进行 group_by 来产生所需的输出?或者可能是 data.table?
最佳答案
这是一种使用gather
将数据从宽到长传播的可能性,按球队分组并汇总主客场比赛的数量。
library(tidyverse)
data %>%
gather(key, Team) %>%
group_by(Team) %>%
summarise(
NumberOfGamesHome = sum(key == "HomeTeam"),
NumberOfGamesaWAY = sum(key == "AwayTeam"))
## A tibble: 4 x 3
# Team NumberOfGamesHome NumberOfGamesaWAY
# <chr> <int> <int>
#1 Team1 2 4
#2 Team2 5 2
#3 Team3 3 3
#4 Team4 2 3
更新
要使用您可以执行的其他列来反射(reflect)更新后的示例数据
data %>%
gather(key, Team, HomeTeam, AwayTeam) %>%
group_by(Team) %>%
summarise(
NumberOfGamesHome = sum(key == "HomeTeam"),
NumberOfGamesaWAY = sum(key == "AwayTeam"))
## A tibble: 4 x 3
# Team NumberOfGamesHome NumberOfGamesaWAY
# <chr> <int> <int>
#1 Team1 2 4
#2 Team2 5 2
#3 Team3 3 3
#4 Team4 2 3
关于r - 使用 dplyr 对多个分组变量进行计数,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/53455485/