这是我的问题。我有大量积极数据。我的目标是删除向量中至少有 N 个连续重复值的序列(所有重复值必须严格 > 0)。
我编写了一个可以运行的程序,如下所示: X 是我的数值向量; N为重复序列的最小长度。
rmpParNASerieRepetee <- function(X, N)
{
X_ <- paste("T", paste(X, collapse="T"), "T", sep="")
ind.parcours <- 1
ind.sup <- c()
# Loop on the values
while ( ind.parcours <= (length(X)-N+1) )
{
# indices of my sequence of N values
deb <- ind.parcours
fin <- ind.parcours + N-1
# sequence of N values to search in the vector
serie <- X[deb:fin]
serie_ <- paste("T", paste(serie, collapse="T"), "T", sep="")
borne <- 1*(ind.parcours < (length(X)-N+1)) + 0*(ind.parcours == (length(X)-N+1))
if (sum(X[(length(X)-N+1):length(X)]==serie)==3) borne <- 0
# split my string vector by my sequence vector of N values and count the pieces of result
if ( length(unlist(strsplit(X_, serie_)))-1 > borne && length(which(serie!=0))>=N)
{ ind.sup <- unique(c(ind.sup, deb:fin)) }
ind.parcours <- ind.parcours+1
}
if (length(ind.sup !=0)) { X[ind.sup] <- NA }
list_return <- list(X=X, Ind.sup=unique(sort(ind.sup)))
return (list_return)
}
我认为我的函数确实不是最优的(对于 92,000 个值的向量,N=18,计算时间为 1:15)。而且我必须执行此步骤 1600 次...大约需要 3 个月...
有人有更好的主意吗?
示例:
x <- c(1,2,3,4,0,4,1,2,3,8,9,1,2,3,4,0)
N <- 3
# (1,2,3) is a sequence of 3 elements which is repeated
# (1,2,3,4) is sequence of 4 elements which is repeated
# no other sequence of length at least 3 repeats
# my result should also be :
# NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
# The result of my program is :
# $X
# [1] NA NA NA NA 0 4 NA NA NA 8 9 NA NA NA NA 0
#$Ind.sup
# [1] 1 2 3 4 7 8 9 12 13 14 15
最佳答案
一种方法:
f <- function(X, N)
{
.rle <- rle(sort(X))
res <- .rle$values[.rle$lengths >= N]
res <- res[res > 0]
inds <- X %in% res
X[inds] <- NA
list(X = X, Ind = which(inds))
}
#> f(X, 3)
#$X
# [1] NA NA NA NA 0 0 0 0 NA NA NA NA NA NA 8 9 NA NA NA NA NA NA 0 0 0
#
#$Ind
# [1] 1 2 3 4 9 10 11 12 13 14 17 18 19 20 21 22
关于删除向量中至少 N 个连续值的序列,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/20426949/