我正在寻找一种方法,在 id
组内,计算数据 datatbl
中 TF
值偏移的唯一出现次数。
我想从 TF
在 1
和 0
或 o
和1
。计数将存储在新变量 PM##
中,以便 PM##
保存 TF
中的每个唯一移位,在加号和减号。下面的 MWE 导致了晚上 7 点的结果,但我的生产数据可以有 15 个或更多类次。如果 TF
值在 NA
之间没有变化,我想将其标记为 0
。
这个问题类似于a question I previously asked ,但关于 TF
独立的最后一部分是新的。两者 Uwe和 Psidom使用 data.table
here 为最初的问题提供了优雅的答案并使用 tidyverse
here . after conferencing with Uwe ,我发布了我的问题的这个稍微修改过的版本。
If this question violates any SO policies please let me know and I'll be happy to reopen my initial question or append this an bounty-issue.
用一个最小的工作示例 来说明我的问题。我有这样的数据,
我有什么,
# install.packages(c("tidyverse"), dependencies = TRUE)
library(tibble)
tbl <- tibble(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)),
TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1, NA, 1, 0, 1, 0, 1, NA, 0L, NA, 0L,
0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
tbl %>% print(n=18)
#> # A tibble: 40 x 2
#> id TF
#> <int> <dbl>
#> 1 10 NA
#> 2 10 NA
#> 3 10 0
#> 4 10 NA
#> 5 10 0
#> 6 10 NA
#> 7 10 1
#> 8 10 1
#> 9 10 1
#> 10 10 1
#> 11 10 1
#> 12 10 NA
#> 13 10 1
#> 14 10 0
#> 15 10 1
#> 16 10 0
#> 17 10 1
#> 18 0 NA
#> # ... with 22 more rows
我想得到什么,
tblPM <- structure(list(id = c(10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L,
10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 10L, 0L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 1L,
1L, 1L, 1L), TF = c(NA, NA, 0, NA, 0, NA, 1, 1, 1, 1, 1,
NA, 1, 0, 1, 0, 1, NA, 0, NA, 0, 0, 1, 1, 1, 0, 0,
NA, NA, 0, NA, 0, 0, 0, 1, 1, 1, 0, NA, 1), PM01 = c(NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L,
-2L, -1L, 1L, 2L, 3L, NA, NA, NA), PM02 = c(NA, NA, NA, NA, 0L,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -2L,
-1L, 1L, 2L, 3L, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L,
-1L, 1L, NA, NA), PM03 = c(NA, NA, NA, NA, NA, NA, 0L, 0L, 0L,
0L, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -3L, -2L,
-1L, 1L, 2L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
0L), PM04 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
-1L, 1L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, 0L, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), PM05 = c(NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA), PM06 = c(NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA), PM07 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, -1L, 1L, NA, NA, NA, NA, NA, NA, NA,
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA
)), .Names = c("id", "TF", "PM01", "PM02", "PM03", "PM04", "PM05",
"PM06", "PM07"), class = c("tbl_df",
"tbl", "data.frame"), row.names = c(NA, -40L
))
tblPM %>% print(n=18)
#> # A tibble: 40 x 9
#> id TF PM01 PM02 PM03 PM04 PM05 PM06 PM07
#> <int> <dbl> <int> <int> <int> <int> <int> <int> <int>
#> 1 10 NA NA NA NA NA NA NA NA
#> 2 10 NA NA NA NA NA NA NA NA
#> 3 10 0 0 NA NA NA NA NA NA
#> 4 10 NA NA NA NA NA NA NA NA
#> 5 10 0 NA 0 NA NA NA NA NA
#> 6 10 NA NA NA NA NA NA NA NA
#> 7 10 1 NA NA 0 NA NA NA NA
#> 8 10 1 NA NA 0 NA NA NA NA
#> 9 10 1 NA NA 0 NA NA NA NA
#> 10 10 1 NA NA 0 NA NA NA NA
#> 11 10 1 NA NA 0 NA NA NA NA
#> 12 10 NA NA NA NA NA NA NA NA
#> 13 10 1 NA NA NA -1 NA NA NA
#> 14 10 0 NA NA NA 1 -1 NA NA
#> 15 10 1 NA NA NA NA 1 -1 NA
#> 16 10 0 NA NA NA NA NA 1 -1
#> 17 10 1 NA NA NA NA NA NA 1
#> 18 0 NA NA NA NA NA NA NA NA
#> # ... with 22 more rows
identical([some solution], tblPM)
#> [1] TRUE
更新 w/microbenchmark
2018-01-24 14:20:18Z,
感谢 Fierr 和 Chris 花时间梳理逻辑并提交答案。启发了我 this setup我已经计算了他们功能的小型微基准比较。我将 Fierr 的答案放入函数
tidyverse_Fierr()并将 Chris 的答案放入
dt_Chris()`(如果有人想要确切的函数,请告诉我,我会添加他们在这里。
经过一些小的调整后,当与 tblPM
匹配时,它们都相同,即
identical(tblPM, tidyverse_Fierr(tbl))
#> [1] TRUE
identical(tblPM, dt_Chris(tbl))
#> [1] TRUE
现在进行快速微基准测试,
df_test <- bind_rows(rep(list(tbl), 111))
microbenchmark::microbenchmark(tidyverse_Fierr(df_test), dt_Chris(df_test), times = 3*1)
#> Unit: milliseconds
#> expr min mean median uq max neval cld
#> tidyverse_Fierr(df_test) 19503.366 20171.268 20080.99 20505.219 20929.4489 3 b
#> dt_Chris(df_test) 199.165 233.924 203.72 251.304 298.8887 3 a
有趣的是,tidy_method 在这个 kinda similar comparison 中出现得更快.
最佳答案
这是一个脚本方法 - 考虑到每个案例的自定义处理量(TF = NA,uniqueN(TF)= 1,uniqueN(TF)= 2,我认为这可能比 dplyr 链更容易实现. 应该相当快,因为它都是基于 data.table 的。愿意接受有关如何改进的建议!
随着所需 PM 列数量的增加,这将自动扩展 - 正如我在下面评论的那样,我建议去掉列中的 0 前缀,因为在某些情况下你可能会达到 10^2..n会撞到 PM001 的列。
library(data.table)
tbl3 <- data.table(id = c(rep(10L, 17L), rep(0L, 13L), rep(1L, 10L)),
TF = c(NA, NA, 0L, NA, 0L, NA, 1L, 1L, 1L, 1L, 1L, NA, 1L, 0L, 1L, 0L, 1L, NA, 0L, NA, 0L,
0L, 1L, 1L, 1L, 0L, 0L, NA, NA, 0L, NA, 0L, 0L, 0L, 1L, 1L, 1L, 0L, NA, 1L))
# create index to untimately join back to
tbl3[, row_idx := .I]
# all transformations on a replicated data.table
tbl3_tmp <- copy(tbl3)
# identify where the NA breaks occur - this splits each id into subgroups (id_group)
tbl3_tmp[, P_TF := shift(TF, 1, "lag", fill = NA), by = .(id)]
tbl3_tmp[, TF_break := is.na(TF) | is.na(P_TF)]
tbl3_tmp[, id_group := cumsum(TF_break), by = .(id)]
tbl3_tmp[, `:=`(TF_break = NULL, P_TF = NULL)] # above can be consolidated to one line which would make this line unneccesary - expanded for easier understanding
tbl3_tmp <- tbl3_tmp[!is.na(TF)] # NA rows can be safely ignored now - these will be all NA, and will be handled with the left join below
# find where subpatterns exist (runs of 0..1 or 1..0)
tbl3_tmp[, subpattern_break := TF != shift(TF, 1, "lag", fill = NA), by = .(id, id_group)]
tbl3_tmp[, subbreaks := sum(subpattern_break, na.rm = TRUE), by = .(id, id_group)] # if there are no breaks, we need to treat separately
# two cases: zero subbreaks and multiple subbreaks.
tbl3_zeros <- tbl3_tmp[subbreaks == 0]
tbl3_nonzeros <- tbl3_tmp[subbreaks > 0]
# for 1+ subbreaks, we need to double the rows - this allows us to easily create the PM_field both "forwards" and "backwards"
tbl3_nonzeros[is.na(subpattern_break), subpattern_break := TRUE]
tbl3_nonzeros[, subbreak_index := cumsum(subpattern_break), by = .(id, id_group)]
tbl3_nonzeros <- rbindlist(list(tbl3_nonzeros,tbl3_nonzeros), idcol = "base") # double the row
tbl3_nonzeros[base == 1 & subbreak_index %% 2 == 1, subbreak_index := subbreak_index + 1L] # round to nearest even
tbl3_nonzeros[base == 2 & subbreak_index %% 2 == 0, subbreak_index := subbreak_index + 1L] # round to nearest odd
# this creates an index when the subbreak starts - allows us to sequence PM properly
tbl3_nonzeros[,subbreak_start := min(row_idx), by = .(id, id_group, subbreak_index)]
# exclude the ends if there is only one unique TF value - might be able to get this to one line
tbl3_nonzeros[, TF_count := uniqueN(TF), by = .(id, id_group, subbreak_index)]
tbl3_nonzeros <- tbl3_nonzeros[TF_count > 1]
# create a 1..N column, subtract the index where the break occurs ,then add 1 to all 0+ values.
tbl3_nonzeros[,PM_field := 1:.N, by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[, PM_field := PM_field - PM_field[which(diff(TF)!=0)[1]+1], by = .(id, id_group, subbreak_index)]
tbl3_nonzeros[PM_field >= 0, PM_field := PM_field + 1L] # base 1 after the break
# create subbreaks for zero groups
tbl3_zeros[,subbreak_start := min(row_idx), by = .(id, id_group)]
# bring zero and non zero case together
tbl3_zeros <- tbl3_zeros[, .(id, id_group, subbreak_start,row_idx = row_idx, PM_field = 0L)]
tbl3_nonzeros <- tbl3_nonzeros[,.(id, id_group, subbreak_start, row_idx, PM_field)]
tbl3_tmp <- rbindlist(list(tbl3_zeros, tbl3_nonzeros))
# Create header
tbl3_tmp <- tbl3_tmp[order(subbreak_start, PM_field)]
tbl3_tmp[, PM_header := paste0("PM0",cumsum(c(1,diff(subbreak_start)!=0)),sep = ""), by = .(id)] # I would remove 0 in PM0 here (kept for identical check)- inefficient to check if this will be 1, 2, 3 etc digits This could also be solved with; `paste0("PM", sprintf("%02d", cumsum(c(1, diff(subbreak_start) != 0))))`
# long to wide
tbl3_tmp <- dcast(tbl3_tmp, row_idx ~ PM_header, value.var = "PM_field", fun.aggregate = sum, fill = NA)
# merge back to initial dataframe
tblPM_frombase <- merge(tbl3, tbl3_tmp, by = "row_idx", all.x = TRUE)[, row_idx := NULL]
identical(tblPM, tblPM_frombase)
[1] TRUE
关于r - 在组内计算值变化前后的值,为每个独特的转变生成新变量,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/48306565/