我有一些 CRP(C react 蛋白)的左删失数据,我想知道如何估算低于检测限的值,估算值将在所需范围内(此处:0<估算值<0.2)。
我正在尝试使用“imputeLCMD”包来实现此目的,但由于“imputeLCMD”及其所有依赖项的安装稍微复杂,我也愿意听到其他方法。
请考虑以下 MWE:
# Load libraries
library(dplyr)
library(imputeLCMD)
# Assign the dputted random data to a data frame
df <- structure(list(participant_id = 1:10, CRP = c("2.9", "<0.2",
"<0.2", "8.8", "9.4", "0.5", "5.3", "8.9", "5.5", "<0.2"), LDL_cholesterol = c(195.7,
145.3, 167.8, 157.3, 110.3, 190, 124.6, 104.2, 132.8, 195.5),
fasting_glucose = c(114.5, 104.6, 102, 119.7, 102.8, 105.4,
97.2, 99.7, 84.5, 77.4), creatinine = c(1.5, 1.4, 1.2, 1.3,
0.5, 1, 1.3, 0.7, 0.8, 0.7)), row.names = c(NA, -10L), class = c("tbl_df",
"tbl", "data.frame"))
上面输入并在下面展示的模拟数据框与我的真实数据类似。
# View the random data
head(df, n=5)
#> # A tibble: 5 × 5
#> participant_id CRP LDL_cholesterol fasting_glucose creatinine
#> <int> <chr> <dbl> <dbl> <dbl>
#> 1 1 2.9 196. 114. 1.5
#> 2 2 <0.2 145. 105. 1.4
#> 3 3 <0.2 168. 102 1.2
#> 4 4 8.8 157. 120. 1.3
#> 5 5 9.4 110. 103. 0.5
但是,我有点不知道如何从这里继续。为了使用 imputeLCMD 包估算这些左删失数据的缺失值,我假设必须首先将左删失值转换为 NA:
df <- df %>%
mutate(CRP = na_if(CRP, "<0.2")) %>%
mutate(CRP = as.numeric(CRP))
head(df, n=5)
#> # A tibble: 5 × 5
#> participant_id CRP LDL_cholesterol fasting_glucose creatinine
#> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2.9 196. 114. 1.5
#> 2 2 NA 145. 105. 1.4
#> 3 3 NA 168. 102 1.2
#> 4 4 8.8 157. 120. 1.3
#> 5 5 9.4 110. 103. 0.5
如果我现在运行 imputeLCMD 包中的包装器,我确实会得到某种结果:
# Impute the missing data
df_imputed <- impute.wrapper.SVD(df, K = 4) %>% as.data.frame()
# Round the result
df_imputed <- df_imputed %>% mutate_at(vars(CRP), ~round(., digits = 1))
# Place the original CRP next to the imputed one for comparison
df_imputed <- df_imputed %>% mutate(original_CRP = df$CRP)
df_imputed <- df_imputed %>% select(1,2,6,3,4,5)
# Display the result
head(df_imputed, n=5)
#> participant_id CRP original_CRP LDL_cholesterol fasting_glucose creatinine
#> 1 1 2.9 2.9 195.7 114.5 1.5
#> 2 2 5.8 NA 145.3 104.6 1.4
#> 3 3 5.9 NA 167.8 102.0 1.2
#> 4 4 8.8 8.8 157.3 119.7 1.3
#> 5 5 9.4 9.4 110.3 102.8 0.5
创建于 2023 年 5 月 27 日 reprex v2.0.2
我的问题:
我不知道如何为 imputeLCMD 包设置参数,以便 估算值应为:>0 AND <0.2。
如何确保imputeLCMD不会将participant_id本身作为数字数据 进入插补计算?
我见过一些great alternative approaches for left-censored data所以,但很高兴知道我在“imputeLCMD”上做错了什么(或正确)。
最佳答案
您可以使用最大似然估计来估计分布 您的数据,同时使用以下一些观察结果的信息 定量下限 (LLOQ)。为了获得合理的估算值,您 然后从估计分布的 < LLOQ 区域进行采样。
让我们用一些模拟的对数正态分布数据来演示:
set.seed(42)
# Simulate some underlying log-normal CRP values
# Parameters from: https://stackoverflow.com/a/63938717/4550695
crp <- rlnorm(10000, 1.355, 1.45)
summary(crp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.011 1.418 3.842 11.158 10.139 2060.559
# Apply a lower limit of quantification to observations
lloq <- 3
crp_obs <- replace(crp, crp < lloq, NA)
summary(crp_obs)
#> Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
#> 3.002 5.019 8.738 18.669 18.107 2060.559 4331
对于估计步骤,我们需要定义一个目标函数;这 具有左删失值的对数正态分布的对数似然:
# Any missing values in x are assumed to be known to be < LLOQ
left_censored_log_normal_log_likelihood <- function(mu, sigma, x, lloq) {
sum(dlnorm(na.omit(x), mu, sigma, log = TRUE)) +
sum(is.na(x)) * plnorm(lloq, mu, sigma, log = TRUE)
}
然后,根据观察到的数据,我们最大化对数似然:
mean_sd <- function(x, ...) {
c(mean(x, ...), sd(x, ...))
}
# Initial values from observed data
theta0 <- mean_sd(log(crp_obs), na.rm = TRUE)
theta0
#> [1] 2.350642 0.924159
fit <- optim(theta0, function(theta) {
-left_censored_log_normal_log_likelihood(theta[1], theta[2], crp_obs, lloq)
})
我们已经可以看到我们恢复了底层参数:
str(fit)
#> List of 5
#> $ par : num [1:2] 1.34 1.45
#> $ value : num 26786
#> $ counts : Named int [1:2] 69 NA
#> ..- attr(*, "names")= chr [1:2] "function" "gradient"
#> $ convergence: int 0
#> $ message : NULL
最后,从拟合分布的 < LLOQ 区域中采样,并 创建完整的数据集:
n <- sum(is.na(crp_obs))
p <- runif(n, 0, plnorm(lloq, fit$par[1], fit$par[2]))
y <- qlnorm(p, fit$par[1], fit$par[2])
summary(y)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.00557 0.62488 1.20758 1.32565 1.97396 2.99494
crp_imp <- replace(crp_obs, which(is.na(crp_obs)), y)
我们已经成功恢复了底层分布:
mean_sd(log(crp_imp))
#> [1] 1.337859 1.461132
summary(crp_imp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.0056 1.4200 3.8418 11.1576 10.1394 2060.5585
summary(crp)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 0.011 1.418 3.842 11.158 10.139 2060.559
我现在已将其纳入实验 censlm封装:
library(censlm) # remotes::install_github("mikmart/censlm")
#> Loading required package: survival
set.seed(42)
# Simulate left-censored log-normal CRP observations
crp <- rlnorm(10000, 1.355, 1.450)
lloq <- 3
obs <- replace(crp, crp < lloq, lloq)
crpdf <- data.frame(crp, lloq, obs)
# Fit censored linear model and extract (random) imputations
fit <- clm(log(obs) ~ 1, left = log(lloq))
imp <- exp(imputed(fit))
summary(cbind(crpdf, imp))
#> crp lloq obs imp
#> Min. : 0.011 Min. :3 Min. : 3.000 Min. : 0.0056
#> 1st Qu.: 1.418 1st Qu.:3 1st Qu.: 3.000 1st Qu.: 1.4200
#> Median : 3.842 Median :3 Median : 3.842 Median : 3.8418
#> Mean : 11.158 Mean :3 Mean : 11.883 Mean : 11.1576
#> 3rd Qu.: 10.139 3rd Qu.:3 3rd Qu.: 10.139 3rd Qu.: 10.1394
#> Max. :2060.559 Max. :3 Max. :2060.559 Max. :2060.5585
MLE 模型拟合是通过 survival::survreg()
完成的:
summary(fit)
#>
#> Call:
#> survreg(formula = Surv(log(obs), log(obs) > log(lloq), type = "left") ~
#> 1, dist = "gaussian")
#> Value Std. Error z p
#> (Intercept) 1.3421 0.0168 79.8 <2e-16
#> Log(scale) 0.3749 0.0103 36.3 <2e-16
#>
#> Scale= 1.45
#>
#> Gaussian distribution
#> Loglik(model)= -13460.2 Loglik(intercept only)= -13460.2
#> Number of Newton-Raphson Iterations: 4
#> n= 10000
关于r - 在 R 中,如何将左删失缺失数据估算在所需范围内(例如 0<估算值<0.2),我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76346589/