mysql - 如何使用 r 或 sql 来计算每个组 ID 的差异?

标签 mysql r

我正在寻找一种方法来计算每个组 ID 的时差。这是我的部分数据:

ID  road    beginTime   endTime Mon Tue Wed Thu Fri Sat
666 757     9:00 AM     11:45 AM                    S
555 758     1:55 PM     3:45 PM  M       W          
555 759     10:40 AM    12:30 PM M       W          
555 760     4:00 PM     5:50 PM     Tue      R      
444 761     3:00 PM     4:25 PM     Tue      R      
444 762     4:30 PM     7:15 PM  M                  
444 763     12:50 PM    2:40 PM                 Fri 
444 764     10:40 AM    11:35 AM    Tue      R      
222 765     11:45 AM    2:30 PM  M      W           
222 766     6:00 PM     9:40 PM              R      
333 767     8:30 AM     11:15 AM M      W           
333 768     8:30 AM     11:15 AM    Tue      R      
333 769     1:25 PM     2:50 PM     Tue      R      
333 770     11:45 AM    1:10 PM  M      W           

dput() 的输出:

structure(list(ID = c(666L, 555L, 555L, 555L, 444L, 444L, 444L, 
444L, 222L, 222L, 333L, 333L, 333L, 333L), road = 757:770, beginTime = structure(c(11L, 
2L, 3L, 7L, 6L, 8L, 5L, 3L, 4L, 9L, 10L, 10L, 1L, 4L), .Label = c("1:25 PM", 
"1:55 PM", "10:40 AM", "11:45 AM", "12:50 PM", "3:00 PM", "4:00 PM", 
"4:30 PM", "6:00 PM", "8:30 AM", "9:00 AM"), class = "factor"), 
    endTime = structure(c(4L, 9L, 5L, 11L, 10L, 12L, 7L, 3L, 
    6L, 13L, 2L, 2L, 8L, 1L), .Label = c("1:10 PM", "11:15 AM", 
    "11:35 AM", "11:45 AM", "12:30 PM", "2:30 PM", "2:40 PM", 
    "2:50 PM", "3:45 PM", "4:25 PM", "5:50 PM", "7:15 PM", "9:40 PM"
    ), class = "factor"), Mon = structure(c(1L, 2L, 2L, 1L, 1L, 
    2L, 1L, 1L, 2L, 1L, 2L, 1L, 1L, 2L), .Label = c("", "M"), class = "factor"), 
    Tue = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 1L, 
    1L, 2L, 2L, 1L), .Label = c("", "Tue"), class = "factor"), 
    Wed = structure(c(1L, 2L, 2L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 
    2L, 1L, 1L, 2L), .Label = c("", "W"), class = "factor"), 
    Thu = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 1L, 2L, 
    1L, 2L, 2L, 1L), .Label = c("", "R"), class = "factor"), 
    Fri = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = c("", "Fri"), class = "factor"), 
    Sat = structure(c(2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
    1L, 1L, 1L, 1L), .Label = c("", "S"), class = "factor")), .Names = c("ID", 
"road", "beginTime", "endTime", "Mon", "Tue", "Wed", "Thu", "Fri", 
"Sat"), class = "data.frame", row.names = c(NA, -14L))

每个ID在一天中的不同时间(beginTime,endTime)行驶在不同的道路(road)上。我想计算每个 ID 的等待(非驾驶)时间。例如,ID=555 在星期一和星期三开车。第一个时段是上午 10:40 至下午 12:30。它等待了 1.41 小时,然后在 1:55 - 3:45 之间开始了另一个时间段。 1.41小时的等待时间正是我所需要的。这个id周二和周四开车的时候还有一个等候时间。对于ID=666,它只在星期六开了一个时段,所以等待时间为0。我的数据难点是每个ID每天都有不同的时段。有什么建议么?非常感谢!

最佳答案

使用我在评论中提到的“长”格式让事情变得更容易一些。

首先,我将稍微清理一下您的数据:将因子转换为字符串,然后将字符串转换为时间(df 是您在上面的dputed 中的数据):

library(dplyr)
# small helper function
astime <- function(x) as.POSIXct(x, format = "%I:%M %p")
df2 <- df %>%
  mutate_each(funs(as.character), beginTime:Sat) %>%
  mutate_each(funs(astime), beginTime, endTime)
head(df2)
#    ID road           beginTime             endTime Mon Tue Wed Thu Fri Sat
# 1 666  757 2016-06-21 09:00:00 2016-06-21 11:45:00                       S
# 2 555  758 2016-06-21 13:55:00 2016-06-21 15:45:00   M       W            
# 3 555  759 2016-06-21 10:40:00 2016-06-21 12:30:00   M       W            
# 4 555  760 2016-06-21 16:00:00 2016-06-21 17:50:00     Tue       R        
# 5 444  761 2016-06-21 15:00:00 2016-06-21 16:25:00     Tue       R        
# 6 444  762 2016-06-21 16:30:00 2016-06-21 19:15:00   M                    

(不要担心日期都是错误的,应该忽略它。)现在我将从宽型转换为长型并删除那些日期为空字符串的实例:

library(tidyr)
df3 <- df2 %>%
  gather(day, ign, Mon:Sat) %>%
  filter(ign != "") %>%
  select(-ign)
head(df3)
#    ID road           beginTime             endTime day
# 1 555  758 2016-06-21 13:55:00 2016-06-21 15:45:00 Mon
# 2 555  759 2016-06-21 10:40:00 2016-06-21 12:30:00 Mon
# 3 444  762 2016-06-21 16:30:00 2016-06-21 19:15:00 Mon
# 4 222  765 2016-06-21 11:45:00 2016-06-21 14:30:00 Mon
# 5 333  767 2016-06-21 08:30:00 2016-06-21 11:15:00 Mon
# 6 333  770 2016-06-21 11:45:00 2016-06-21 13:10:00 Mon

现在我将它们分组并计算等待时间:

df4 <- df3 %>%
  arrange(ID, day, beginTime) %>%
  group_by(ID, day) %>%
  mutate(
    waitTime = difftime(beginTime, dplyr::lag(endTime, default = beginTime[1]), units='secs')
  )
head(df4)
# Source: local data frame [6 x 6]
# Groups: ID, day [5]
#      ID  road           beginTime             endTime   day       waitTime
#   <int> <int>              <time>              <time> <chr> <S3: difftime>
# 1   222   765 2016-06-21 11:45:00 2016-06-21 14:30:00   Mon         0 secs
# 2   222   766 2016-06-21 18:00:00 2016-06-21 21:40:00   Thu         0 secs
# 3   222   765 2016-06-21 11:45:00 2016-06-21 14:30:00   Wed         0 secs
# 4   333   767 2016-06-21 08:30:00 2016-06-21 11:15:00   Mon         0 secs
# 5   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Mon      1800 secs
# 6   333   768 2016-06-21 08:30:00 2016-06-21 11:15:00   Thu         0 secs

您可以轻松过滤出有人等待的时间:

df4 %>%
  filter(waitTime > 0)
# Source: local data frame [8 x 6]
# Groups: ID, day [8]
#      ID  road           beginTime             endTime   day       waitTime
#   <int> <int>              <time>              <time> <chr> <S3: difftime>
# 1   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Mon      1800 secs
# 2   333   769 2016-06-21 13:25:00 2016-06-21 14:50:00   Thu      7800 secs
# 3   333   769 2016-06-21 13:25:00 2016-06-21 14:50:00   Tue      7800 secs
# 4   333   770 2016-06-21 11:45:00 2016-06-21 13:10:00   Wed      1800 secs
# 5   444   761 2016-06-21 15:00:00 2016-06-21 16:25:00   Thu     12300 secs
# 6   444   761 2016-06-21 15:00:00 2016-06-21 16:25:00   Tue     12300 secs
# 7   555   758 2016-06-21 13:55:00 2016-06-21 15:45:00   Mon      5100 secs
# 8   555   758 2016-06-21 13:55:00 2016-06-21 15:45:00   Wed      5100 secs

在这种情况下,您会看到 ID 555 的示例在星期一和星期三有 1.41 小时(5100 秒)的休息时间,而 ID 666 没有等待时间。

关于mysql - 如何使用 r 或 sql 来计算每个组 ID 的差异?,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/37952113/

相关文章:

javascript - 如何处理交易错误?

r - 在 R 中转换日期返回 NA

r - ggplot : Centre and move the vertical axis labels

r - 检查向量是否包含在 R 中的矩阵中

r - 计算多列均值和单列 sigma 的概率密度函数值

javascript - 删除并添加到数据库

python - 如何更新 MySQL 中的列

mysql - 复杂的 SQL 查询。至少对于我来说

mysql - 检查输入字符串是否符合数据库排序规则

具有多个变量输入的 R 自定义 data.table 函数