我想整理一些不幸的是在前两行中设置了两个列标题的数据:
第一行(标题):实际上是度量的类型(例如。 估计、标准误差、上限、下限)。
第二行(也是标题):是度量的年份。
有什么方法可以使用gather()
或其他方法来整理这些数据吗?
此外,当重复测量时(例如Rank、Rank.1),它实际上应该只读取Rank,并且仅因年份而不同。有什么办法可以解决这个问题吗?
Country_Territory WBCode Estimate StdErr NumSrc Rank Lower
1 Year <NA> 1996.00 1996.00 1996 1996.00 1996.00
2 Andorra ADO 1.32 0.48 1 87.10 72.04
3 Afghanistan AFG -1.29 0.34 2 4.30 0.00
4 Angola AGO -1.17 0.26 4 9.68 0.54
Upper Estimate.1 StdErr.1 NumSrc.1 Rank.1 Lower.1 Upper.1
1 1996.00 1998.00 1998.00 1998 1998.00 1998.00 1998.00
2 96.77 1.38 0.46 1 89.18 74.74 96.91
3 27.42 -1.18 0.33 2 9.79 0.00 31.44
4 27.42 -1.41 0.21 6 1.55 0.00 13.40
这是输入我的数据示例的代码:
df <- data.frame(stringsAsFactors=FALSE,
Country_Territory = c("Year", "Andorra", "Afghanistan", "Angola"),
WBCode = c(NA, "ADO", "AFG", "AGO"),
Estimate = c(1996, 1.32, -1.29, -1.17),
StdErr = c(1996, 0.48, 0.34, 0.26),
NumSrc = c(1996, 1, 2, 4),
Rank = c(1996, 87.1, 4.3, 9.68),
Lower = c(1996, 72.04, 0, 0.54),
Upper = c(1996, 96.77, 27.42, 27.42),
Estimate = c(1998, 1.38, -1.18, -1.41),
StdErr = c(1998, 0.46, 0.33, 0.21),
NumSrc = c(1998, 1, 2, 6),
Rank = c(1998, 89.18, 9.79, 1.55),
Lower = c(1998, 74.74, 0, 0),
Upper = c(1998, 96.91, 31.44, 13.4)
)
例如,这次尝试:
df %>% gather(key = measure, value = number, 3:14)
没有给我我想要的东西:
Country_Territory WBCode measure number
1 Year <NA> Estimate 1996.00
2 Andorra ADO Estimate 1.32
3 Afghanistan AFG Estimate -1.29
4 Angola AGO Estimate -1.17
5 Year <NA> StdErr 1996.00
6 Andorra ADO StdErr 0.48
因为年份与 Country_Territory 混合在一起。
最佳答案
这是一种选择:
library(tidyverse)
# get unique Year values and create column names (to add later)
df %>%
filter(Country_Territory == "Year") %>%
gather() %>%
filter(value != "Year" & !is.na(value)) %>%
pull(value) %>%
unique() %>%
paste0("Year_",.) -> col_years
# reshape data (excluding the Year row)
df %>%
filter(Country_Territory != "Year") %>%
gather(key,y,-Country_Territory, -WBCode) %>%
separate(key, c("measure","v")) %>%
group_by(v = ifelse(is.na(v), 0, v)) %>%
nest() -> df_info
reduce(df_info$data, function(x,y) left_join(x,y,by=c("Country_Territory","WBCode","measure"))) %>%
setNames(c("Country_Territory", "WBCode", "measure", col_years))
# # A tibble: 18 x 5
# Country_Territory WBCode measure Year_1996 Year_1998
# <chr> <chr> <chr> <dbl> <dbl>
# 1 Andorra ADO Estimate 1.32 1.38
# 2 Afghanistan AFG Estimate -1.29 -1.18
# 3 Angola AGO Estimate -1.17 -1.41
# 4 Andorra ADO StdErr 0.48 0.46
# 5 Afghanistan AFG StdErr 0.34 0.33
# 6 Angola AGO StdErr 0.26 0.21
# 7 Andorra ADO NumSrc 1 1
# 8 Afghanistan AFG NumSrc 2 2
# 9 Angola AGO NumSrc 4 6
# 10 Andorra ADO Rank 87.1 89.2
# 11 Afghanistan AFG Rank 4.3 9.79
# 12 Angola AGO Rank 9.68 1.55
# 13 Andorra ADO Lower 72.0 74.7
# 14 Afghanistan AFG Lower 0 0
# 15 Angola AGO Lower 0.54 0
# 16 Andorra ADO Upper 96.8 96.9
# 17 Afghanistan AFG Upper 27.4 31.4
# 18 Angola AGO Upper 27.4 13.4
如果将上述输出另存为 df_upd
,您可以稍微重新调整形状,将 Year
作为一列:
df_upd %>%
gather(Year, value, -Country_Territory, -WBCode, -measure) %>%
separate(Year, c("y","Year"), convert = T) %>%
select(-y)
# # A tibble: 36 x 5
# Country_Territory WBCode measure Year value
# <chr> <chr> <chr> <int> <dbl>
# 1 Andorra ADO Estimate 1996 1.32
# 2 Afghanistan AFG Estimate 1996 -1.29
# 3 Angola AGO Estimate 1996 -1.17
# 4 Andorra ADO StdErr 1996 0.48
# 5 Afghanistan AFG StdErr 1996 0.34
# 6 Angola AGO StdErr 1996 0.26
# 7 Andorra ADO NumSrc 1996 1
# 8 Afghanistan AFG NumSrc 1996 2
# 9 Angola AGO NumSrc 1996 4
# 10 Andorra ADO Rank 1996 87.1
# # ... with 26 more rows
关于R:使用 Gather() 来整理具有两个列标题的数据,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/52699732/