r - 三明治+mlogit : `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors

在使用 mlogit() 拟合离散选择问题中的多项 Logit (MNL) 后，我尝试计算鲁棒/集群标准误差。不幸的是，我怀疑我遇到了问题，因为我使用 long 格式的数据(这在我的情况下是必须的)，并在 #Error in ef/X : non-conformable arrays 之后收到错误 sandwich::vcovHC( , "HC0") 。

数据

为了便于说明，请仔细考虑以下数据。它代表来自 5 个个体 (id_ind ) 的数据，他们在 3 个替代方案 (altern) 中进行选择。五个人每人选择了三次；因此我们有 15 种选择情况 ( id_choice )。每个替代方案都由两个通用属性( x1 和 x2 )表示，并且选择在 y 中注册(如果选择则为 1，否则为 0)。

df <- read.table(header = TRUE, text = "
id_ind id_choice altern           x1          x2 y
1       1         1      1  1.586788801  0.11887832 1
2       1         1      2 -0.937965347  1.15742493 0
3       1         1      3 -0.511504401 -1.90667519 0
4       1         2      1  1.079365680 -0.37267925 0
5       1         2      2 -0.009203032  1.65150370 1
6       1         2      3  0.870474033 -0.82558651 0
7       1         3      1 -0.638604013 -0.09459502 0
8       1         3      2 -0.071679538  1.56879334 0
9       1         3      3  0.398263302  1.45735788 1
10      2         4      1  0.291413453 -0.09107974 0
11      2         4      2  1.632831160  0.92925495 0
12      2         4      3 -1.193272276  0.77092623 1
13      2         5      1  1.967624379 -0.16373709 1
14      2         5      2 -0.479859282 -0.67042130 0
15      2         5      3  1.109780885  0.60348187 0
16      2         6      1 -0.025834772 -0.44004183 0
17      2         6      2 -1.255129594  1.10928280 0
18      2         6      3  1.309493274  1.84247199 1
19      3         7      1  1.593558740 -0.08952151 0
20      3         7      2  1.778701074  1.44483791 1
21      3         7      3  0.643191170 -0.24761157 0
22      3         8      1  1.738820924 -0.96793288 0
23      3         8      2 -1.151429915 -0.08581901 0
24      3         8      3  0.606695064  1.06524268 1
25      3         9      1  0.673866953 -0.26136206 0
26      3         9      2  1.176959443  0.85005871 1
27      3         9      3 -1.568225496 -0.40002252 0
28      4        10      1  0.516456176 -1.02081089 1
29      4        10      2 -1.752854918 -1.71728381 0
30      4        10      3 -1.176101700 -1.60213536 0
31      4        11      1 -1.497779616 -1.66301234 0
32      4        11      2 -0.931117325  1.50128532 1
33      4        11      3 -0.455543630 -0.64370825 0
34      4        12      1  0.894843784 -0.69859139 0
35      4        12      2 -0.354902281  1.02834859 0
36      4        12      3  1.283785176 -1.18923098 1
37      5        13      1 -1.293772990 -0.73491317 0
38      5        13      2  0.748091387  0.07453705 1
39      5        13      3 -0.463585127  0.64802031 0
40      5        14      1 -1.946438667  1.35776140 0
41      5        14      2 -0.470448172 -0.61326604 1
42      5        14      3  1.478763383 -0.66490028 0
43      5        15      1  0.588240775  0.84448489 1
44      5        15      2  1.131731049 -1.51323232 0
45      5        15      3  0.212145247 -1.01804594 0
")

问题

因此，我们可以使用 mlogit() 拟合 MNL 并提取其稳健的方差-协方差，如下所示:

library(mlogit)
library(sandwich)
mo <-  mlogit(formula = y ~ x1 + x2|0 , 
              method ="nr",  
              data =  df,
              idx  =  c("id_choice", "altern"))

sandwich::vcovHC(mo, "HC0")
#Error in ef/X : non-conformable arrays

正如我们所看到的，sandwich::vcovHC 产生了一个错误，表明 ef/X 不符合要求。其中 X <- model.matrix(x) 和 ef <- estfun(x, ...) 。查看 the source code on the mirror on GitHub 后，我发现问题来自以下事实:鉴于数据为长格式， ef 具有维度 15 x 2 ，而 X 具有 45 x 2 。

我的解决方法

鉴于演出必须继续，我正在使用从 sandwich 和 I adjusted to accommodate the Stata's output. 借用的一些函数手动计算鲁棒性和集群标准错误

> 稳健的标准错误

这些行的灵感来自 sandwich::meat() 函数。

psi<- estfun(mo)
k <- NCOL(psi)
n <- NROW(psi)
rval <-  (n/(n-1))* crossprod(as.matrix(psi))
vcov(mo) %*% rval %*% vcov(mo)

#            x1         x2
# x1 0.23050261 0.09840356
# x2 0.09840356 0.12765662

Stata 等效

qui clogit y x1 x2 ,group(id_choice) r
mat li e(V)
symmetric e(V)[2,2]
            y:         y:
            x1         x2
y:x1  .23050262
y:x2  .09840356  .12765662

> 集群标准错误

这里，考虑到每个人回答 3 个问题，个体之间很可能存在某种程度的相关性；因此在这种情况下应该优先选择聚类校正。下面我计算了这种情况下的聚类校正，并显示了与 clogit , cluster() 的 Stata 输出的等价性。

id_ind_collapsed <- df$id_ind[!duplicated(mo$model$idx$id_choice,)]
psi_2 <- rowsum(psi, group = id_ind_collapsed )

k_cluster <- NCOL(psi_2)
n_cluster <- NROW(psi_2)
rval_cluster <-  (n_cluster/(n_cluster-1))* crossprod(as.matrix(psi_2))
vcov(mo) %*% rval_cluster %*% vcov(mo)

#           x1        x2
# x1 0.1766707 0.1007703
# x2 0.1007703 0.1180004

Stata 等效项

qui clogit y x1 x2 ,group(id_choice) cluster(id_ind)
symmetric e(V)[2,2]
            y:         y:
            x1         x2
y:x1  .17667075
y:x2   .1007703  .11800038

问题:

我希望在 sandwich 生态系统中适应我的计算，这意味着不手动计算矩阵，而是实际使用 sandwich 函数。是否可以使其与此处描述的长格式模型一起使用？例如，直接提供 meat 和 bread 对象来执行计算？提前致谢。

PS:我注意到 there is a dedicated bread function in sandwich for mlogit ，但我找不到 meat 之类的东西，但无论如何我可能在这里遗漏了一些东西......

最佳答案

为什么 vcovHC 不适用于 mlogit

HC 协方差估计器类只能应用于具有单个线性预测器的模型，其中得分函数(又称估计函数)是所谓的“工作残差”和回归矩阵的乘积。 Zeileis (2006) 论文对此进行了详细解释(参见公式 7)，在包中以 vignette("sandwich-OOP", package = "sandwich") 的形式提供。 ?vcovHC 也指出了这一点，但没有很好地解释。我在 http://sandwich.R-Forge.R-project.org/reference/vcovHC.html 的文档中对此进行了改进现在:

The function meatHC is the real work horse for estimating the meat of HC sandwich estimators - the default vcovHC method is a wrapper calling sandwich and bread. See Zeileis (2006) for more implementation details. The theoretical background, exemplified for the linear regression model, is described below and in Zeileis (2004). Analogous formulas are employed for other types of models, provided that they depend on a single linear predictor and the estimating functions can be represented as a product of “working residual” and regressor vector (Zeileis 2006, Equation 7).

这意味着 vcovHC() 不适用于多项 Logit 模型，因为它们通常对单独的响应类别使用单独的线性预测变量。同样，不支持两部分或障碍模型等。

基本的“稳健”三明治协方差

通常，为了计算基本的 Eicker-Huber-White 三明治协方差矩阵估计器，最佳策略是使用 sandwich() 函数，而不是 vcovHC()功能。前者适用于任何具有 estfun() 和 bread() 方法的模型。

对于线性模型sandwich(..., adjustment = FALSE)(默认)和sandwich(..., adjustment = TRUE)对应于HC0和HC1，分别。在具有 n 个观测值和 k 个回归系数的模型中，前者使用 1/n 进行标准化，后者使用 1/(n-k) 进行标准化.

但是，Stata 在 logit 模型中除以 1/(n-1)，请参阅: Different Robust Standard Errors of Logit Regression in Stata and R 。据我所知，没有明确的理论理由专门使用其中一种或另一种调整。而且在中等大的样本中，这无论如何都没有什么区别。

备注:1/(n-1) 的调整不能直接作为选项在 sandwich() 中使用。然而，巧合的是，它是 vcovCL() 中的默认值，没有指定集群变量(即，将每个观测值视为一个单独的集群)。因此，如果您想获得与 Stata 完全相同的结果，这是一个方便的“技巧”。

聚类协方差

这可以通过vcovCL(..., cluster = ...)“照常”计算。对于 mlogit 模型，您只需考虑 cluster 变量只需要提供一次(而不是以长格式堆叠多次)。

复制 Stata 结果

使用您帖子中的数据和模型:

vcovCL(mo)
##            x1         x2
## x1 0.23050261 0.09840356
## x2 0.09840356 0.12765662
vcovCL(mo, cluster = df$id_choice[1:15])
##           x1        x2
## x1 0.1766707 0.1007703
## x2 0.1007703 0.1180004

关于r - 三明治+mlogit : `Error in ef/X : non-conformable arrays` when using `vcovHC()` to compute robust/clustered standard errors，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/66412110/