r - 使用 fitdistrplus 拟合截断对数正态分布

标签 r error-handling statistics data-fitting fitdistrplus

在尝试将对数正态分布拟合到截断的数据时,我发现了以下两个 Stackoverflow 帖子并关注了它们:

Fitting a lognormal distribution to truncated data in R Fitting a truncated lognormal distribution in R

但是,该解决方案似乎不再有效,因为 truncdist 包中的 dtrunc 和 ptrunc 函数现在无法传递 test function of fitdistrplus.

dtruncated_log_normal <- function(x, a,b, meanlog, sdlog)
  dtrunc(x, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

ptruncated_log_normal <- function(q, a,b, meanlog, sdlog)
  ptrunc(q, "lnorm", a=a, b=b, meanlog=meanlog, sdlog=sdlog)

fit <- fitdist(s, "truncated_log_normal", start=list(a=0.001, b=90, meanlog=mean(log(s)), sdlog=sd(log(s))))

我们基本上遇到了测试函数的所有错误,返回

Error in fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  : 
  the function mle failed to estimate the parameters, 
                with the error code 100
In addition: Warning messages:
1: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The dtruncated_log_normal function should return a vector of with NaN values when input has inconsistent values and not raise an error
2: In fitdist(s, "truncated_log_normal", start = list(a = 0.001, b = 90,  :
  The ptruncated_log_normal function should return a vector of with NaN values when input has inconsistent parameters and not raise an error

包含超过 2000 个元素的向量示例:

> dput(head(s,150))
c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

2023 年有解决办法吗?我的数据已经清理完毕,不包含不一致、空、NA 或任何其他奇怪的值。

最佳答案

第一个问题是参数的限制:0 < a < min(s) < max(s) < b0 < sdlog是模型的要求。优化器不知道这一点,并且会遇到麻烦,因为违反这些限制的值会生成大量错误。

处理此类问题的一种方法是修改参数,使其不受限制。例如,您可以使用 log(a) 作为参数,因为它保证为正数,并使用 log(b-a) 作为另一个参数,以保证 b > alog(sdlog) 作为另一个参数。

下一个问题更难。对于优化器尝试的某些值,原始分布落在截断区间内的概率评估为零。具体来说,我在调试 a = 0.001, b = 90, sdlog = 0.009529981 (您的起始值)和 meanlog = 5.176146 (比起始值稍大)时看到了这一点。

概率实际上并不是零,它是一个向下舍入为零的小值。解决方案是使用对数概率而不是概率,但我认为您无法使用该选项——出现此问题的代码位于 truncdist 中,但参数来自 fitdistrplus 。这两个包需要一起工作才能解决这个问题,或者也许你 可以编写专门版本的 dtruncptrunc 来完成此操作。

编辑添加:

dtrunc 使用的基本思想是截断密度 等于常规密度除以处于截断区间的概率,即 d/(pb - pa) 其中 d 是全密度,pbpa 是端点处的 CDF 值。数值问题是 pb == pa 由于四舍五入所致。

解决这个问题的方法是重新调整所有内容,并在对数尺度上进行工作。也就是说,使用 (d/pb)/(1 - pa/pb) = exp(log(d) - log(pb) - log1p(-exp(log(pa) - log(pb))))

下面的代码可以完成此操作并解决第一个问题。它不使用 truncdist 包,而是使用基本函数进行计算。

但这还不够!现在的问题是 fitdist 强制 optim 计算 Hessian 矩阵,而数值问题导致其失败。有一种解决方法:我将定义一个“自定义”优化函数,这只是常规的 optim() ,其中 hessian 参数强制为 FALSE

结果如下。抱歉,没有标准错误。

library(fitdistrplus) 
#> Loading required package: MASS
#> Loading required package: survival

s <- c(88.443, 89.296, 89.327, 87.776, 89.405, 89.824, 89.997, 87.678, 
       89.665, 88.814, 88.841, 89.728, 89.365, 89.476, 89.189, 88.251, 
       88.939, 89.945, 89.567, 89.613, 89.317, 89.622, 87.674, 89.19, 
       89.782, 89.891, 89.954, 89.556, 89.093, 89.637, 89.052, 87.395, 
       87.835, 89.357, 87.733, 89.459, 88.197, 88.539, 88.564, 87.857, 
       88.74, 88.955, 89.691, 88.102, 89.635, 89.116, 89.584, 88.288, 
       86.95, 89.182, 89.435, 88.93, 87.567, 89.083, 88.52, 88.897, 
       89.54, 88.557, 89.269, 89.854, 89.31, 88.274, 89.126, 89.431, 
       88.257, 88.872, 88.978, 89.03, 87.434, 88.305, 89.656, 87.556, 
       89.209, 89.508, 87.781, 88.068, 89.933, 87.256, 88.906, 89.067, 
       88.92, 87.947, 88.196, 88.951, 89.594, 88.378, 87.482, 88.817, 
       89.65, 89.392, 89.932, 87.896, 89.909, 89.265, 89.954, 89.827, 
       87.49, 87.786, 89.208, 89.728, 88.905, 87.566, 86.612, 88.363, 
       87.457, 87.639, 88.907, 88.425, 87.244, 88.546, 88.221, 89.293, 
       87.469, 87.31, 89.107, 88.442, 89.133, 88.812, 88.418, 89.456, 
       88.512, 89.514, 87.446, 88.374, 89.282, 87.415, 89.004, 87.627, 
       89.107, 89.168, 89.589, 89.288, 88.496, 89.807, 87.518, 88.796, 
       88.001, 87.322, 87.353, 88.055, 88.81, 88.456, 87.876, 87.7, 
       88.675, 88.996, 89.479, 86.781, 86.928, 87.356)

dtruncated_log_normal <- function(x, loga, logbminusa, meanlog, logsdlog) {

  a <- exp(loga)
  b <- exp(logbminusa) + a
  goodvals <- is.finite(x) & a <= x & x <= b
  sdlog <- exp(logsdlog)
  # cat("parms:", c(min(x[goodvals]), max(x[goodvals]), loga, a, logbminusa, b, meanlog, sdlog), "\n")
  logpab <- plnorm(c(a,b), meanlog, sdlog, log = TRUE)
  result <- rep(0, length(x))
  logd <- dlnorm(x[goodvals], meanlog, sdlog, log = TRUE)
  result[goodvals] <- exp(logd - logpab[2] - log1p(-exp(logpab[1]- logpab[2])))
  result
}

ptruncated_log_normal <- function(q, loga, logbminusa, meanlog, logsdlog) rep(NaN, length(q))

myoptim <- function(..., hessian) {
  optim(..., hessian = FALSE)
}

fit <- fitdist(s, "truncated_log_normal", start=list(loga=log(min(s) - 1), logbminusa=log(max(s) - min(s) + 2), meanlog=mean(log(s)), logsdlog=log(sd(log(s)))), custom.optim = myoptim)

fit
#> Fitting of the distribution ' truncated_log_normal ' by maximum likelihood 
#> Parameters:
#>             estimate Std. Error
#> loga        4.439900         NA
#> logbminusa  1.654508         NA
#> meanlog     4.490567         NA
#> logsdlog   -4.354114         NA

# Convert back to the original scale:
ests <- as.list(fit$estimate)
with(ests, {   
  a <- exp(loga)
  b <- exp(logbminusa) + a
  sdlog <- exp(logsdlog)
  c(a = a, b = b, meanlog = meanlog, sdlog = sdlog)
  })
#>           a           b     meanlog       sdlog 
#> 84.76649227 89.99700006  4.49056699  0.01285382

创建于 2023 年 3 月 19 日,使用 reprex v2.0.2

关于r - 使用 fitdistrplus 拟合截断对数正态分布,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/75781356/

相关文章:

r - 在因子变量上使用 nchar 函数

r - 以更快的方式计算欧氏距离

forms - 基于输入的URL重定向,但如果输入的URL不存在,则发送回同一页面

Python + 散点图 + 其他废话

R 根据自己的意愿对向量进行排序

r - 给定具有相同列数的向量,如何选择矩阵的元素?

swift - 核心数据 : error in constraints

c - 从 C 中的 argv[i] 检查 fopen 时出错

r - 主成分对象的标准值在 prcomp 和 caret 中不同

python - 在Python中跳过某些文件夹