r - 在 df2 的日期时间中使用 df1 的 "hour"和 "min"上的条件合并 2 个数据帧

标签 r dataframe dplyr data.table non-equi-join

我有一个这样的数据框df.sample

id <- c("A","A","A","A","A","A","A","A","A","A","A")
date <- c("2018-11-12","2018-11-12","2018-11-12","2018-11-12","2018-11-12",
          "2018-11-12","2018-11-12","2018-11-14","2018-11-14","2018-11-14",
          "2018-11-12")
hour <- c(8,8,9,9,13,13,16,6,7,19,7)
min <- c(47,59,6,18,22,36,12,32,12,21,47)
value <- c(70,70,86,86,86,74,81,77,79,83,91)
df.sample <- data.frame(id,date,hour,min,value,stringsAsFactors = F) 
df.sample$date <- as.Date(df.sample$date,format="%Y-%m-%d")

我有另一个像这样的数据框df.state

id <- c("A","A","A")
starttime <- c("2018-11-12 08:59:00","2018-11-14 06:24:17","2018-11-15 09:17:00")
endtime <- c("2018-11-12 15:57:00","2018-11-14 17:22:16","2018-11-15 12:17:32")
state <- c("Pass","Pass","Pass")

df.state <- data.frame(id,starttime,endtime,state,stringsAsFactors = F) 
df.state$starttime <- as.POSIXct(df.state$starttime,format="%Y-%m-%d %H:%M:%S")
df.state$endtime <- as.POSIXct(df.state$endtime,format="%Y-%m-%d %H:%M:%S")

我正在尝试根据条件合并这两个数据框

如果 df.sample 中的 hourminstarttimeendtimedf.state,然后将state = Pass合并到df.sample中。

例如,df.sample 中的第 2 行有 hour = 8min = 59,因为它在 starttime = 2018-11-12 08:59:00df.state中,添加值Pass

这是我期望的输出

   id       date hour min value state
    A 2018-11-12    8  47    70      
    A 2018-11-12    8  59    70  Pass
    A 2018-11-12    9   6    86  Pass
    A 2018-11-12    9  18    86  Pass
    A 2018-11-12   13  22    86  Pass
    A 2018-11-12   13  36    74  Pass
    A 2018-11-12   16  12    81      
    A 2018-11-14    6  32    77  Pass
    A 2018-11-14    7  12    79  Pass
    A 2018-11-14   19  21    83      
    A 2018-11-12    7  47    91      

我能够像这样合并这两个数据帧,但无法在 df.state 的开始时间和结束时间中查找 df.sample 的小时和分钟

library(tidyverse)
df.sample <- df.sample %>%
  left_join(df.state)

谁能给我指出正确的方向

最佳答案

如果您碰巧有大数据帧,使用 data.table 包中的非相等连接会更快更容易: Benchmark | Video

library(data.table)

## convert both data.frames to data.tables by reference
setDT(df.sample)
setDT(df.state) 

## create a `time` column in df.sample 
df.sample[, time := as.POSIXct(paste0(date, " ", hour, ":", min, ":00"))]
## change column order
setcolorder(df.sample, c("id", "time"))

# join by id and time within start & end time limits
# "x." is used so we can refer to the column in other data.table explicitly
df.state[df.sample, .(id, time, date, hour, min, value, state = x.state), 
         on = .(id, starttime <= time, endtime >= time)]
#>     id                time       date hour min value state
#>  1:  A 2018-11-12 08:47:00 2018-11-12    8  47    70  <NA>
#>  2:  A 2018-11-12 08:59:00 2018-11-12    8  59    70  Pass
#>  3:  A 2018-11-12 09:06:00 2018-11-12    9   6    86  Pass
#>  4:  A 2018-11-12 09:18:00 2018-11-12    9  18    86  Pass
#>  5:  A 2018-11-12 13:22:00 2018-11-12   13  22    86  Pass
#>  6:  A 2018-11-12 13:36:00 2018-11-12   13  36    74  Pass
#>  7:  A 2018-11-12 16:12:00 2018-11-12   16  12    81  <NA>
#>  8:  A 2018-11-14 06:32:00 2018-11-14    6  32    77  Pass
#>  9:  A 2018-11-14 07:12:00 2018-11-14    7  12    79  Pass
#> 10:  A 2018-11-14 19:21:00 2018-11-14   19  21    83  <NA>
#> 11:  A 2018-11-12 07:47:00 2018-11-12    7  47    91  <NA>

### remove NA
df.state[df.sample, .(id, time, date, hour, min, value, state = x.state), 
         on = .(id, starttime <= time, endtime >= time), nomatch = 0L]
#>    id                time       date hour min value state
#> 1:  A 2018-11-12 08:59:00 2018-11-12    8  59    70  Pass
#> 2:  A 2018-11-12 09:06:00 2018-11-12    9   6    86  Pass
#> 3:  A 2018-11-12 09:18:00 2018-11-12    9  18    86  Pass
#> 4:  A 2018-11-12 13:22:00 2018-11-12   13  22    86  Pass
#> 5:  A 2018-11-12 13:36:00 2018-11-12   13  36    74  Pass
#> 6:  A 2018-11-14 06:32:00 2018-11-14    6  32    77  Pass
#> 7:  A 2018-11-14 07:12:00 2018-11-14    7  12    79  Pass

reprex package 创建于 2019-05-23 (v0.3.0)

关于r - 在 df2 的日期时间中使用 df1 的 "hour"和 "min"上的条件合并 2 个数据帧,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56281178/

相关文章:

r - 如何计算逻辑向量中的行数

python - Pandas 迭代地追加来自多个 DataFrame 列的行值

r - 如何在 r 中的多行上迭代使用 mutate

r - 我如何对 data.table 中特定列的不同子集取平均值?

python - 我们如何根据列循环数据框并根据条件检索行

r - 当有 n 个连续虚拟对象时进行子集化

r - 如何在 R data.table 中检索按行最大值的列?

r - ggplotly 与水平 geom_crossbar() 不匹配原始 ggplot

r - 具有多种条件和自定义范围的热图

r - 如何创建一个数据框来收集 R 中具有多个索引的 for 循环的结果?