我有一个由三列组成的 df,第一列包含唯一 ID,第二列包含数值,最后一列包含 POSIXct 格式的日期和时间。见下文。
我尝试使用 dput 提供前 10 个观察结果。我希望这能起作用。请注意,我必须删除internal.selfref等。
> dput(bp2s[1:10])
structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L,
100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L,
120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976682620,
5976684000, 5976687600, 5976691200, 5976694800, 5976698400, 5976785280,
5976788400, 5976790200, 5976792000), tzone = "", class = c("POSIXct",
"POSIXt"))), row.names = c(NA, -10L), class = c("data.table",
"data.frame"))
现在我想单独检查每个 ID,看看 VALUE 是否在任何时候都保持在 100 以下至少半个小时。然后,结果应以 bool 值显示在新列(“下方”)中,如果半小时或更长时间 < 100,则为 TRUE,否则为 FALSE。 bool 值应放置在起始点的行中。 结果应如下所示:
HADM_ID VALUE TIME BELOW
1: 100210 92 2159-05-26 08:40:00 TRUE
2: 100210 98 2159-05-26 09:00:00 FALSE
3: 100210 105 2159-05-26 09:12:00 FALSE
4: 100889 92 2166-08-15 14:50:00 FALSE
5: 100889 98 2166-08-15 15:00:00 FALSE
6: 100889 101 2166-08-15 15:15:00 FALSE
7: 100520 89 2133-02-03 14:15:00 TRUE
8: 100520 102 2133-02-03 15:15:00 FALSE
使用更新 tidyverse 提案:时间差异正在发挥作用,并且第一次正确检测到值在 30 分钟内保持 < 100+(观察 7)。但请注意,接下来的两个观测值(8 和 9)仍然 < 100 并保持 < 100,但它们被标记为 FALSE。
HADM_ID VALUE time time_diff grp below
<int> <int> <dttm> <drtn> <int> <lgl>
1 100210 112 2159-05-24 15:37:00 23 mins 1 FALSE
2 100210 120 2159-05-24 16:00:00 60 mins 2 FALSE
3 100210 121 2159-05-24 17:00:00 60 mins 3 FALSE
4 100210 112 2159-05-24 18:00:00 60 mins 4 FALSE
5 100210 106 2159-05-24 19:00:00 60 mins 5 FALSE
6 100210 109 2159-05-24 20:00:00 1448 mins 6 FALSE
7 100210 80 2159-05-25 20:08:00 52 mins 6 TRUE
8 100210 89 2159-05-25 21:00:00 30 mins 6 FALSE #should be TRUE since Value stays under 100 for next 30 min
9 100210 85 2159-05-25 21:30:00 30 mins 6 FALSE #should be TRUE see above
10 100210 99 2159-05-25 22:00:00 30 mins 6 FALSE
使用更新的 data.table 解决方案,我得到以下结果,看起来很准确。
HADM_ID VALUE time timenext grp BELOW
1: 100210 112 2159-05-24 15:37:00 2159-05-24 15:37:00 1 FALSE
2: 100210 120 2159-05-24 16:00:00 2159-05-24 16:00:00 1 FALSE
3: 100210 121 2159-05-24 17:00:00 2159-05-24 17:00:00 1 FALSE
4: 100210 112 2159-05-24 18:00:00 2159-05-24 18:00:00 1 FALSE
5: 100210 106 2159-05-24 19:00:00 2159-05-24 19:00:00 1 FALSE
6: 100210 109 2159-05-24 20:00:00 2159-05-24 20:00:00 1 FALSE
7: 100210 80 2159-05-25 20:08:00 2159-05-25 21:00:00 2 TRUE
8: 100210 89 2159-05-25 21:00:00 2159-05-25 21:30:00 2 TRUE
9: 100210 85 2159-05-25 21:30:00 2159-05-25 22:00:00 2 TRUE
10: 100210 99 2159-05-25 22:00:00 2159-05-25 22:30:00 2 TRUE
11: 100210 89 2159-05-25 22:30:00 2159-05-25 23:00:00 2 FALSE
12: 100210 102 2159-05-25 23:00:00 2159-05-25 23:00:00 3 FALSE
问候
最佳答案
尝试这个data.table
解决方案:
library(data.table)
bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100889L, 100889L, 100889L, 100520L, 100520L), VALUE = c(92L, 98L, 105L, 92L, 98L, 101L, 89L, 102L), time = structure(c(5976852000, 5976853200, 5976853920, 6204797100, 6204798000, 6204798900, 5146744500, 5146748100), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -8L)))
bp[, timenext := shift(time, type = "lead", fill = time[.N]),
by = HADM_ID
][, grp := cumsum(VALUE >= 100),
by = .(HADM_ID)
][, BELOW := as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
by = .(HADM_ID, grp)
][, c("timenext", "grp") := NULL ]
# HADM_ID VALUE time BELOW
# <int> <int> <POSc> <lgcl>
# 1: 100210 92 2159-05-26 08:40:00 TRUE
# 2: 100210 98 2159-05-26 09:00:00 FALSE
# 3: 100210 105 2159-05-26 09:12:00 FALSE
# 4: 100889 92 2166-08-15 14:45:00 FALSE
# 5: 100889 98 2166-08-15 15:00:00 FALSE
# 6: 100889 101 2166-08-15 15:15:00 FALSE
# 7: 100520 89 2133-02-03 14:15:00 TRUE
# 8: 100520 102 2133-02-03 15:15:00 FALSE
已更新新数据:
bp <- setDT(structure(list(HADM_ID = c(100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L, 100210L), VALUE = c(112L, 120L, 121L, 112L, 106L, 109L, 80L, 89L, 85L, 99L), time = structure(c(5976704220, 5976705600, 5976709200, 5976712800, 5976716400, 5976720000, 5976806880, 5976810000, 5976811800, 5976813600), class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -10L)))
bp[, timenext := fifelse(VALUE >= 100, time, shift(time, type = "lead", fill = time[.N])),
by = HADM_ID
][, grp := rleid(VALUE < 100), by = .(HADM_ID)
][, BELOW := VALUE < 100 & as.numeric(max(timenext,na.rm=TRUE) - time, units="mins") > 30,
by = .(HADM_ID, grp)
]
# HADM_ID VALUE time timenext grp BELOW
# <int> <int> <POSc> <POSc> <int> <lgcl>
# 1: 100210 112 2159-05-24 15:37:00 2159-05-24 15:37:00 1 FALSE
# 2: 100210 120 2159-05-24 16:00:00 2159-05-24 16:00:00 1 FALSE
# 3: 100210 121 2159-05-24 17:00:00 2159-05-24 17:00:00 1 FALSE
# 4: 100210 112 2159-05-24 18:00:00 2159-05-24 18:00:00 1 FALSE
# 5: 100210 106 2159-05-24 19:00:00 2159-05-24 19:00:00 1 FALSE
# 6: 100210 109 2159-05-24 20:00:00 2159-05-24 20:00:00 1 FALSE
# 7: 100210 80 2159-05-25 20:08:00 2159-05-25 21:00:00 2 TRUE
# 8: 100210 89 2159-05-25 21:00:00 2159-05-25 21:30:00 2 TRUE
# 9: 100210 85 2159-05-25 21:30:00 2159-05-25 22:00:00 2 FALSE
# 10: 100210 99 2159-05-25 22:00:00 2159-05-25 22:00:00 2 FALSE
关于r - 使用数据帧中另一列的时间戳检查特定时间跨度的列中的值,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/69087944/