我编写了这个循环来提取在时间间隔(bin)内出现的向量的每个元素的名称。我想知道我是否缺少一种更快的方法来执行此操作...我想对长度为 1000 的向量实现随机化方面,因此不想依赖循环。
mydata <- structure(c(1199.91666666667, 1200.5, 1204.63333333333, 1205.5,
1206.3, 1208.73333333333, 1209.06666666667, 1209.93333333333,
1210.98333333333, 1214.56666666667, 1216.06666666667, 1216.63333333333,
1216.91666666667, 1219.13333333333, 1221.35, 1221.51666666667,
1225.35, 1225.53333333333, 1225.96666666667, 1227.61666666667,
1228.91666666667, 1230.31666666667, 1233.53333333333, 1235.8,
1237.51666666667, 1239.41666666667, 1241.6, 1247.08333333333,
1247.45, 1252.7, 1253.26666666667), .Names = c("B", "A", "B",
"E", "A", "A", "B", "G", "G", "C", "A", "D", "E", "B", "B", "E",
"E", "G", "F", "A", "C", "A", "F", "B", "A", "F", "F", "G", "F",
"G", "F"))
mydata
B A B E A A B G G C A D E B B E E
1199.917 1200.500 1204.633 1205.500 1206.300 1208.733 1209.067 1209.933 1210.983 1214.567 1216.067 1216.633 1216.917 1219.133 1221.350 1221.517 1225.350
G F A C A F B A F F G F G F
1225.533 1225.967 1227.617 1228.917 1230.317 1233.533 1235.800 1237.517 1239.417 1241.600 1247.083 1247.450 1252.700 1253.267
这些代表事件的连续时间(以秒为单位)。假设我们想让间隔为 5 秒。我的方法是创建每个间隔开始的向量,然后使用循环查找该间隔内出现的元素的名称:
N=5
ints <- seq(mydata[1], mydata[length(mydata)], N)
out<-list()
for(i in 1:length(ints)){
out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])
}
out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
这对于小样本来说很好 - 但我可以看到,在处理排列了 1000 次的非常大的样本时,这会变得很慢。
最佳答案
我的建议是使用 findInterval
(基于 this earlier question of mine 的答案):
mydata2 = c(-Inf, mydata)
ints <- seq(mydata[1], mydata[length(mydata)]+5, N)
idx = findInterval(ints-1e-10, mydata2)
out<-list()
for(i in 1:(length(ints)-1)){
out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])
}
正如您所看到的,我必须对开头进行一些修改(添加小于第一个断点的第一个值,添加一个 epsilon)。这是结果,与您的结果相同:
> out
[[1]]
[1] "B" "A" "B"
[[2]]
[1] "E" "A" "A" "B"
[[3]]
[1] "G" "G" "C"
[[4]]
[1] "A" "D" "E" "B"
[[5]]
[1] "B" "E"
[[6]]
[1] "E" "G" "F" "A" "C"
[[7]]
[1] "A" "F"
[[8]]
[1] "B" "A" "F"
[[9]]
[1] "F"
[[10]]
[1] "G" "F"
[[11]]
[1] "G" "F"
就示例的速度而言,有一些改进:
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 67.177 76.9725 85.73347 82.8035 95.866 119.890 100
mts 43.851 52.7150 62.72116 58.3130 73.007 96.099 100
alexis 75.573 86.5360 95.72593 91.4340 100.531 234.649 100
R_Yoda 2032.066 2158.4870 2303.68887 2191.3750 2281.409 8719.314 100
对于较大的向量(我选择长度 2000),这更清楚:
set.seed(123)
mydata = sort(runif(n = 2000, min = 0, max = 100))
names(mydata) = sample(LETTERS[1:7], size = 2000, replace = T)
mydata2 = c(-Inf, mydata)
ints2 <- seq(mydata[1], mydata[length(mydata)]+5, N)
dt <- data.table(data=mydata, names=names(mydata) )
> microbenchmark( jalapic = {out<-list(); for(i in 1:length(ints)){out[[i]] <- names(mydata[mydata>=ints[i] & mydata<ints[i]+N])}},
+ mts = {idx = findInterval(ints2-1e-10, mydata2); out<-list(); for(i in 1:(length(ints)-1)){out[[i]] <- names(mydata2[(idx[i]+1):(idx[i+1])])}},
+ alexis = {split(names(mydata), findInterval(mydata, ints))},
+ R_Yoda = {dt[, groups := cut2(data,ints)]; result <- dt[, paste0(names, collapse=", "), by=groups]})
Unit: microseconds
expr min lq mean median uq max neval
jalapic 804.243 846.9275 993.9957 862.0890 883.3140 7140.218 100
mts 77.439 88.8685 100.6148 100.0640 106.5955 188.466 100
alexis 187.066 204.7930 220.1689 215.5225 225.3190 299.026 100
R_Yoda 3831.348 4066.4640 4366.5382 4140.1700 4248.8635 11829.923 100
关于r - 按时间仓提取向量名称,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/34047920/