我有一个数据框,其中包含 1964 年至 2013 年 76 个站点的每日降雨量值。每行代表特定电台的不同月份。这是数据框的片段 -
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0
...
Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USW00093129 2013 10 31 0 0 0 0 0 0 0 0 43 15 0 0 10 0 0 0 0 0 0 0 0 0 0 0 0 0 0 41 3 8 0
USW00093129 2013 11 30 0 0 0 23 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 79 18 20 0 0 0 0 0 0 0 Inf
USW00093129 2013 12 31 0 0 175 33 0 0 3 0 0 0 0 0 0 0 0 0 0 0 5 15 0 0 0 0 0 0 0 0 0 0 0
我试图找到每行非零降雨量值的最长一段的长度以及该段的总降雨量。找到最长拉伸(stretch)长度的最简单方法是将数据帧转换为 0 和 1,使用 rle 并沿每行应用 max(y$lengths[y$values!=0])
。但我如何找到这些值的总和呢?
提前感谢您的帮助!
最佳答案
不完全是一句简单的话,但这是有效的:
df <- read.table(header=TRUE,stringsAsFactors=FALSE,check.names=FALSE,text=
"Station Year Month Days 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
USC00020750 1964 1 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25 0 23 51 36 0 0 0 0 0 0 0 0
USC00020750 1964 2 29 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 48 0 0 0 3 0 0 0 0 0 0 Inf Inf
USC00020750 1964 3 31 0 46 51 0 0 36 41 46 0 0 0 0 43 0 0 0 0 0 0 0 0 53 99 140 36 0 0 0 0 0 0
USC00020750 1964 4 30 5 69 23 30 0 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 33 13 0 0 0 15 0 Inf
USC00020750 1964 5 31 0 0 0 0 0 0 43 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 51 8 0 0 0 0
USC00020750 1964 6 30 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 38 0 0 0 Inf
USC00020750 1964 7 31 0 0 0 0 0 0 0 0 0 0 0 0 41 0 13 13 0 0 0 0 8 51 0 71 0 10 0 0 20 165 25
USC00020750 1964 8 31 8 30 137 0 0 5 89 0 0 0 18 64 5 0 0 0 0 0 0 0 0 0 0 0 0 76 0 0 0 0 0
USC00020750 1964 9 30 0 0 0 0 0 119 0 0 0 0 0 0 0 41 25 0 0 0 0 0 25 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 10 31 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
USC00020750 1964 11 30 0 5 0 0 0 0 0 0 0 0 91 0 0 0 36 94 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Inf
USC00020750 1964 12 31 0 107 20 0 0 0 0 0 0 0 0 0 0 0 0 0 0 79 152 0 0 0 0 0 0 0 0 0 0 0 0")
res <- lapply(1:nrow(df), function(r){
monthDays <- df[r,'Days']
rain <- as.numeric(df[r,(1:monthDays) + 4])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
})
columnsToAdd <- do.call(rbind,res)
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
结果:
# We print the result without months values for better readability
> df2[,-(5:35)]
Station Year Month Days StretchLen StretchRain
1 USC00020750 1964 1 31 3 110
2 USC00020750 1964 2 29 1 48
3 USC00020750 1964 3 31 4 328
4 USC00020750 1964 4 30 4 127
5 USC00020750 1964 5 31 2 59
6 USC00020750 1964 6 30 1 38
7 USC00020750 1964 7 31 3 210
8 USC00020750 1964 8 31 3 175
9 USC00020750 1964 9 30 2 66
10 USC00020750 1964 10 31 0 0
11 USC00020750 1964 11 30 2 130
12 USC00020750 1964 12 31 2 127
顺便说一句,如果你想坚持应用,那就像这样:
columnsToAdd <-
t(apply(df[,-(1:3)],MARGIN=1,function(r){
monthDays <- r[1]
rain <- as.numeric(r[-1])
enc <- rle(rain > 0)
if(all(!enc$values))
return(c(0,0))
len <- enc$lengths
len[!enc$values] <- 0
max.idx <- which.max(len)
lastIdx <- cumsum(enc$lengths)[max.idx]
firstIdx <- lastIdx - enc$lengths[max.idx] + 1
tot <- sum(rain[firstIdx:lastIdx])
stretch <- lastIdx - firstIdx + 1
return(c(stretch,tot))
}))
colnames(columnsToAdd) <- c('StretchLen','StretchRain')
df2 <- cbind(df,columnsToAdd)
我不喜欢在 data.frame 上使用 apply
,因为它是为矩阵创建的,因此它会在调用函数之前将列强制为相同类型(因此,如果您处理的是不同的类型你需要小心)。
关于r - 最长非零值字符串的总和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/33464840/