r - Stan 模型后验预测超出可能的数据范围

标签 r stan rstan

我现在在 Stan 中学习建模的绳索很有趣。现在我正在努力研究我的受试者间和受试者内混合析因实验设计模型。有不同的受试者组,每个受试者都表示他们对三种不同饮料(水、无咖啡因和咖啡)中的每一种饮料减少咖啡因戒断的期望程度。结果变量——减少戒断的预期——通过视觉模拟量表从 0 到 10 进行测量,其中 0 表示没有减少戒断的预期,10 表示非常高的戒断减少预期。我想测试三种不同饮料的预期戒断减少潜力是否存在组间差异。

这是数据

df <- data.frame(id = rep(1:46, each = 3),
                 group = c(3,3,3,1,1,1,3,3,3,1,1,1,3,3,3,1,1,1,3,3,3,2,2,2,1,1,1,3,3,3,3,3,3,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,2,2,2,2,2,2,3,3,3,1,1,1,2,2,2,3,3,3,2,2,2,3,3,3,3,3,3,2,2,2,3,3,3,3,3,3,1,1,1,3,3,3,3,3,3,1,1,1,2,2,2,2,2,2,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,2,2,2,2,2,2,1,1,1,1,1,1,3,3,3,1,1,1,3,3,3),
                 bevType = rep(c(3,2,1), times = 46),
                 score = c(2.9,1.0,0.0,9.5,5.0,4.5,9.0,3.0,5.0,5.0,0.0,3.0,9.5,2.0,3.0,8.5,0.0,6.0,5.2,3.0,4.0,8.4,7.0,2.0,10.0,0.0,3.0,7.3,1.0,1.8,8.5,2.0,9.0,10.0,5.0,10.0,8.3,2.0,5.0,6.0,0.0,5.0,6.0,0.0,5.0,10.0,0.0,5.0,6.8,1.0,4.8,8.0,1.0,4.0,7.0,4.0,6.0,6.5,1.0,3.1,9.0,1.0,0.0,6.0,0.0,2.0,9.5,4.0,6.0,8.0,1.0,3.8,0.4,0.0,7.0,7.0,0.0,3.0,9.0,2.0,5.0,9.5,2.0,7.0,7.9,5.0,4.9,8.0,1.0,1.0,9.3,5.0,7.9,6.5,2.0,3.0,8.0,2.0,6.0,10.0,0.0,5.0,6.0,0.0,5.0,6.8,0.1,7.0,8.0,3.0,9.1,8.2,0.0,7.9,8.2,5.0,0.0,9.2,1.0,3.1,9.1,3.0,0.6,5.7,2.0,5.1,7.0,0.0,7.4,8.0,1.0,1.5,9.1,4.0,4.3,8.5,8.0,5.0))

现在是模型。该模型有一个总均值参数 a,一个分类预测变量,表示组与总均值的偏差 bGroup,一个表示不同饮料类型与总均值的偏差的术语 bBev,每个受试者的截距 bSubj 的术语,以及饮料交互组的术语 bGxB。我还估计了每种饮料类型的单独噪声参数。

为了允许后验预测检查,我使用 generated quantities block 和 normal_rng 函数从联合后验中提取。

### Step 1: Put data into list
dList <- list(N = 138,
              nSubj = 46,
              nGroup = 3,
              nBev = 3,
              sIndex = df$id,
              gIndex = df$group,
              bIndex = df$bevType,
              score = df$score,
              gMean = 4.718841,
              gSD = 3.17)


#### Step 1 model
write("
      data{
      int<lower=1> N;
      int<lower=1> nSubj;
      int<lower=1> nGroup;
      int<lower=1> nBev;
      int<lower=1,upper=nSubj> sIndex[N];
      int<lower=1,upper=nGroup> gIndex[N];
      int<lower=1,upper=nBev> bIndex[N];
      real score[N];
      real gMean;
      real gSD;
      }

      parameters{
      real a;
      vector[nSubj] bSubj;
      vector[nGroup] bGroup;
      vector[nBev] bBev;
      vector[nBev] bGxB[nGroup];      // vector of vectors, stan no good with matrix
      vector[nBev] sigma;  
      real<lower=0> sigma_a;
      real<lower=0> sigma_s;
      real<lower=0> sigma_g;
      real<lower=0> sigma_b;
      real<lower=0> sigma_gb;
      }

      model{
      vector[N] mu;

      //hyper-priors
      sigma_s ~ normal(0,10);     
      sigma_g ~ normal(0,10);
      sigma_b ~ normal(0,10);
      sigma_gb ~ normal(0,10);


      //priors
      sigma ~ cauchy(0,1);
      a ~ normal(gMean, gSD);
      bSubj ~ normal(0, sigma_s);
      bGroup ~ normal(0,sigma_g);
      bBev ~ normal(0,sigma_b);
      for (i in 1:nGroup) {               //hierarchical prior on interaction
      bGxB[i] ~ normal(0, sigma_gb);
      }

      // likelihood
      for (i in 1:N){
      score[i] ~ normal(a + bGroup[gIndex[i]] + bBev[bIndex[i]] + bSubj[sIndex[i]] + bGxB[gIndex[i]][bIndex[i]], sigma[bIndex[i]]);
      }
      }

      generated quantities{
      real y_draw[N];
      for (i in 1:N) {
      y_draw[i] = normal_rng(a + bGroup[gIndex[i]] + bBev[bIndex[i]] + bSubj[sIndex[i]] + bGxB[gIndex[i]][bIndex[i]], sigma[bIndex[i]]);
      }
      }
      ", file = "temp.stan")

##### Step 3: generate the chains
mod <-  stan(file = "temp.stan",
             data = dList,
             iter = 5000,  
             warmup = 3000, 
             cores = 1,
             chains = 1)

接下来,我们从联合后验中提取绘图,并生成组平均值、95% HPDI 上限和下限的估计值。首先我们需要一个函数来计算 HPDI

HPDIFunct <- function (vector) { 
  sortVec <- sort(vector)
  ninetyFiveVec <- ceiling(.95*length(sortVec))
  fiveVec <- length(sortVec) - length(ninetyFiveVec)
  diffVec <- sapply(1:fiveVec, function (i) sortVec[i + ninetyFiveVec] - sortVec[i])
  minVal <- sortVec[which.min(diffVec)]
  maxVal <- sortVec[which.min(diffVec) + ninetyFiveVec]
  return(list(sortVec, minVal, maxVal))
}

现在从后部提取平局

#### Step 5: Posterior predictive checks
y_draw <- data.frame(extract(mod, pars = "y_draw"))


And plot the mean, lower HPDI and upper HPDI draws of these draws against the actual data. 

df$drawMean <- apply(y_draw, 2, mean)
df$HPDI_Low <- apply(y_draw, 2, function(i) HPDIFunct(i)[[2]][1])
df$HPDI_Hi <- apply(y_draw, 2, function(i) HPDIFunct(i)[[3]][1])

### Step 6: plot posterior draws against actual data
ggplot(df, aes(x = factor(bevType), colour = factor(group))) +
       geom_jitter(aes(y = score), shape = 1, position = position_dodge(width=0.9)) +
       geom_point(aes(y = drawMean), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 3, size = 3, stroke = 2) +
       geom_point(aes(y = HPDI_Low), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 1, size = 3, stroke = 1) +
       geom_point(aes(y = HPDI_Hi), position = position_dodge(width=0.9), stat = "summary", fun.y = "mean", shape = 1, size = 3, stroke = 1) +
       scale_colour_manual(name = "Experimental Group", labels = c("Group 1", "Group 2", "Group 3"), values = c("#616a6b", "#00AFBB", "#E7B800")) +  
       scale_x_discrete(labels = c("Water", "Decaf", "Coffee")) + 
       labs(x = "Beverage Type", y = "Expectancy of Withdrawal Alleviation") +
       scale_y_continuous(breaks = seq(0,10,2)) +
       theme(axis.text.x = element_text(size = 12),
             axis.title.x = element_text(face = "bold"),
             axis.title.y = element_text(face = "bold"),
             axis.text.y = element_text(size = 12),
             legend.title = element_text(size = 13, face = "bold"))

enter image description here

查看图表,对于水资源预期,该模型似乎很好地代表了数据的中心(十字)和分布(空心圆)。但这打破了 Decaf 和 Coffee 的预期。对于 Decaf 预期,较低的 HPDI 低于可能值的范围(下限 = 0),并且从后部绘制的图的分布(在每组中由空心圆圈表示)太大。 Coffee 组的 HPDI 上限也高于数据范围(上限 = 10),对于实际数据而言,分布太大。

所以我的问题是:

如何限制从关节后方到数据实际范围的绘图?

在 Stan 中是否有某种蛮力方法来限制从后部抽签?或者对三种饮料条件下的方差差异进行更具适应性的估计是否会更有效(在这种情况下,这更像是一个 CV 问题而不是 SO 问题)?

最佳答案

约束后验变量的标准方法是使用链接函数对其进行转换。这就是逻辑回归和泊松回归等广义线性模型 (GLM) 的工作方式。例如,要从正数到无约束,我们使用对数变换。要从 (0, 1) 中的概率变为不受约束,我们使用对数几率变换。

如果您的结果是 1-10 范围内的序数值,则尊重该数据范围的常见方法是 ordinal logistic regression .

关于r - Stan 模型后验预测超出可能的数据范围,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/56828887/

相关文章:

r - 查找(并返回)满足(逻辑)测试的列表的第一个元素

stan - 如何在Stan模型中传递不同长度的向量列表和不同维度的矩阵列表?

python - PyStan API 中的变分推理?

r - 需要修复广义帕累托分布的 Stan 代码以接受实参而不是向量

r - 从R中的 "rstanarm"包获取标准化系数?

regex - 在字符串中仅两个下划线之间提取数字模式

r - 如何根据频率表绘制密度

R函数用阴影绘制不等式

r - 在 rstan 中指定矩阵的先验分布

r - 运行 rstan 程序时编译代码出错(错误 127)