r - 难以在 R 中拟合分段线性数据

我有以下数据(产品成本与时间)，如下所示:

annum <- c(1903, 1904, 1905, 1906, 1907, 1908, 1909, 1910, 1911, 1912, 1913, 
    1914, 1915, 1916, 1917, 1918, 1919)
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
    133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
    339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

mydata <- as.data.frame(cbind(annum, cost))

g <- ggplot(mydata, aes(x = annum, y = cost))
g <- g + geom_point()
g <- g + scale_y_continuous(labels=scales::dollar_format())
g

This is the resulting plot of this data using this code
该图显示了对我来说看起来是分段线性的；从 1904 年到 1905 年有一个台阶；然后是从 1905 年到 1910 年的清晰界线；然后是一步；然后是从 1911 年到结束的另一行。 (第一点 (1903, 0) 是虚构的。)
我尝试使用分段包对此进行建模，但它没有选择 1904.5 和 1910.5 之类的东西作为断点，而是在 1911 和 1912 之间找到两个点。
我尝试了一些其他技术(例如，“The R Book”中的“蛮力”和直接拟合)，但我显然没有像我需要的那样理解这一点。任何帮助将不胜感激。
理想情况下，我最终会得到每个段的方程和显示分段拟合的单个图和拟合的置信区间。

最佳答案

一个可以使用包结构变更为了这。这是一个简化的代码版本:

library("strucchange")

startyear <- startyear
cost <- c(0.0000,  18.6140,  92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=1903)
plot(ts)

## for small data sets you might consider to reduce segment length
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)

## BIC selection of breakpoints
plot(bp)
breakdates(bp)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
coef(fm1)

plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(bp)
confint(bp)

lines(confint(bp))

可以在包装小插图或相关出版物之一中找到更多信息，例如https://doi.org/10.18637/jss.v007.i02因此，例如可以进行显着性检验、估计置信区间或包括协变量。
段长度为 2 是不可能的，因为无法估计剩余方差。同样，只有当段足够长时才能估计置信区间。因此，下面只显示一个断点，而@Rui Barradas 的优秀答案省略了置信区间，但显示了两个断点。
one breakpoint

她的例子没有前两点和一个额外的假设来估计小段情况下的置信区间:

library("strucchange")

startyear <- 1905
cost <- c(92.1278, 101.9393, 112.0808, 122.5521, 
          133.3532, 144.4843, 244.5052, 275.6068, 295.2592, 317.3145, 
          339.6527, 362.3537, 377.7775, 402.8443, 437.5539)

ts <- ts(cost, start=startyear)
bp <- breakpoints(ts ~ time(ts), data=ts, h = 5)
fm1 <- lm(ts ~ time(ts) * breakfactor(bp), data=ts)
plot(ts, type="p")
lines(ts(fitted(fm1),  start = startyear),  col = 4)
lines(confint(bp, het.err=FALSE))

编辑:

修正原始版本的错误

添加了系数和置信区间

图片添加

添加省略前 2 个值的示例

关于r - 难以在 R 中拟合分段线性数据，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/70141883/

r - 难以在 R 中拟合分段线性数据

上一篇：typescript - 在该组件的其余部分找不到 Vue3 设置返回值

下一篇：c++ - OpenGL:如果您无论如何都必须绑定(bind)目标，那么 "named"缓冲区函数有什么意义？