r - 将数据分成两个独立的组 s.t.最小化具有一个连续预测变量的残差平方和

标签 r algorithm statistics grouping partitioning

将一组数据分成两组的基本算法是什么 s.t.两个单独的残差平方和之和最小化?例如,考虑下面的代码。基本上,您如何在不反复测试每个可能值的情况下计算存储在 best.cutpoint$RSS 中的值?

set.seed(1)
ind.var <- runif(1000, 1, 50000)
dep.var <- ind.var * runif(1000, 2, 3) + rnorm(1000, 100, 500)

dat <- data.frame(ind.var, dep.var)

best.cutpoint <- list(RSS = Inf, cutpoint = NA)
for(cutpoint in sort(unique(ind.var))){
    # partition data
    dat1 <- dat[dat$ind.var > cutpoint,]
    dat2 <- dat[!(dat$ind.var > cutpoint),]

    if(nrow(dat1) < 2 | nrow(dat2) < 2){
        next
    }
    # estimate
    mod1 <- lm(dep.var ~ ind.var, dat = dat1)
    mod2 <- lm(dep.var ~ ind.var, dat = dat2)

    # calculate RSS
    part1.RSS <- sum((dat1$dep.var - (mod1$coefficients['(Intercept)'] + dat1$ind.var * mod1$coefficients['ind.var'])) ^ 2)
    part2.RSS <- sum((dat2$dep.var - (mod2$coefficients['(Intercept)'] + dat2$ind.var * mod2$coefficients['ind.var'])) ^ 2)

    total <- part1.RSS + part2.RSS

    if(total < best.cutpoint$RSS){
        best.cutpoint <- list(RSS = total, cutpoint = cutpoint)
    }
}

从以下可能值范围生成以下结果。

> print(best.cutpoint)
$RSS
[1] 75241532557

$cutpoint
[1] 34351.46

> range(dat$ind.var)
[1]    66.73151 49996.52975

最佳答案

在我看来,您好像在问如何确定 segmented or piecewise linear regression 的断点.如果情况并非如此,请告诉我。

该软件包可用于此目的Segmented

首先让我们生成一些数据:

x<-seq(1:20)
y<-c(seq(1:10),seq(10,100,by=10))
plot(x,y)

这个数据看起来像,

enter image description here

“断点”在哪里很明显。

接下来,让我们用分段包拟合模型:

library(segmented)
lin.mod <- lm(y~x)
segmented.mod <- segmented(lin.mod, seg.Z = ~x, psi=14)

找到断点了吗?

plot(segmented.mod)
points(x,y)

enter image description here

看起来确实如此。

> segmented.mod
Call: segmented.lm(obj = lin.mod, seg.Z = ~x, psi = 14)

Meaningful coefficients of the linear terms:
(Intercept)            x         U1.x  
     0.1818       0.9545       9.0455  

Estimated Break-Point(s) psi1.x : 11.08 

其中 seg.z 和 psi 定义为:

seg.Z a formula with no response variable, such as seg.Z=~x1+x2, indicating the
(continuous) explanatory variables having segmented relationships with the response.
Currently, formulas involving functions, such as seg.Z=~log(x1) or
seg.Z=~sqrt(x1), or selection operators, such as seg.Z=~d[,"x1"] or seg.Z=~d$x1,
are not allowed.
psi named list of vectors. The names have to match the variables of the seg.Z
argument. Each vector includes starting values for the break-point(s) for the
corresponding variable in seg.Z. If seg.Z includes only a variable, psi may be
a numeric vector. A NA value means that ‘K’ quantiles (or equally spaced values)
are used as starting values; K is fixed via the seg.control auxiliary function.

关于r - 将数据分成两个独立的组 s.t.最小化具有一个连续预测变量的残差平方和,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/28402368/

相关文章:

javascript - JS : can a hash map be used to keep track while iterating this array?

r - 创建一个从 R 中的不同列计算连续天数的列

r - 在R中制作四象限比例面积图

r - ggplot 指定经度/纬度轴中断

php - 令人困惑的 PHP BCrypt 实现

algorithm - 随机排序运行时间

php - 我的自定义 PHP stdev 函数 VS mysql 的 STDDEV_POP

r - R 中奇怪的 svm 行为 (e1071)

matlab - 生成特定范围内的随机数

r - 如何将多行 reshape 为具有多列的单行