我正在尝试计算滚动马哈拉诺比斯距离而不求助于 for 循环并惨遭失败。
这是一个示例数据集:
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
我正在尝试按组 ( label
) 计算 valx
的马哈拉诺比斯距离, valy
, 和 valz
, 但仅使用该日期 ( date
) 或之前的行。我目前的解决方案是遍历每个 label
, 遍历每个 date
,将数据集过滤为匹配数据,使用 stats::mahalanobis
计算距离, 将该距离添加到列表中,然后是 do.call
和 rbind
他们在循环之外*。显然这并不理想。
我怀疑有某种写法:
cum.mdist <- function(df, cols) {...}
df %>%
group_by(label) %>%
arrange(date) %>%
mutate(mdist = xapply(., c(valx, valy, valz), cum.mdist)) %>%
ungroup()
以类似于这样计算滚动一元函数的方式:
cumsd <- function(x) sapply(seq_along(x), function(k, z) sd(z[1:k]), z = x)
如果没有协方差,我可以计算与组成部分的距离(滚动方差方差很容易使用上述函数计算),但我认为我的变量确实具有协方差,并且我不确定如何构建滚动协方差矩阵...
是否存在 for 循环之外的解决方案?
*循环解决方案的代码如下:
library("tidyverse")
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 1.5, sd = 1)),
valy = c(rnorm(5, mean = 100, sd = 10), rnorm(5, mean = 115, sd = 10)),
valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
mdist.list <- vector(length = nrow(df), mode = "list")
counter <- 1
for(l in seq_along(unique(df$label))){
label_data <- df %>%
filter(label == unique(df$label)[l])
for(d in seq_along(unique(label_data$date))){
label_date_data <- label_data %>%
filter(date <= unique(label_data$date)[d])
if(nrow(label_date_data) > 3){
label_date_data$mdist <- mahalanobis(label_date_data %>% select(contains("val")),
colMeans(label_date_data %>% select(contains("val"))),
cov(label_date_data %>% select(contains("val"))))
} else{
label_date_data$mdist <- NA
}
mdist.list[[counter]] <- filter(label_date_data,
date == unique(label_data$date)[d])
counter <- counter + 1
}
}
mdist.df <- do.call(rbind, mdist.list)
最佳答案
不确定我是否正确理解了您的要求或期望的输出,下面是使用 data.table
帮助您入门的内容:
library(data.table)
setDT(df)
df[, mdist :=
.SD[, transpose(lapply(1L:.N, function(n) {
ma <- .SD[1L:n]
ans <- tryCatch(mahalanobis(ma, colMeans(ma), var(ma)), error=function(e) NA)
ans[length(ans)]
})), by=.(label), .SDcols=valx:valz]$V1]
输出:
label date valx valy valz mdist
1: A 2018-01-01 1.262954285 7.635935 -2.2426789 NA
2: A 2018-01-02 -0.326233361 -7.990092 3.7739565 NA
3: A 2018-01-03 1.329799263 -11.476570 1.3333636 NA
4: A 2018-01-04 1.272429321 -2.894616 8.0418951 2.2500000
5: A 2018-01-05 0.414641434 -2.992151 -0.5710677 0.7260652
6: B 2018-01-01 -1.539950042 -4.115108 15.1082392 NA
7: B 2018-01-02 -0.928567035 2.522234 32.5730809 NA
8: B 2018-01-03 -0.294720447 -8.919211 -20.7286152 NA
9: B 2018-01-04 -0.005767173 4.356833 -38.5379806 2.2500000
10: B 2018-01-05 2.404653389 -12.375384 1.4017852 3.0800360
数据:
set.seed(0L)
df <- data.frame(label = c(rep("A", 5), rep("B", 5)),
date = rep(seq.Date(from = as.Date("2018-01-01"), by = "days", length.out = 5), 2),
valx = c(rnorm(5, mean = 0, sd = 1), rnorm(5, mean = 0, sd = 1)),
valy = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 10)),
valz = c(rnorm(5, mean = 0, sd = 10), rnorm(5, mean = 0, sd = 30)))
如果您只是在寻找 tidyverse
解决方案,我将删除这篇文章。
关于r - R中的矢量化滚动/累积马氏距离,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/54315561/