r - 修正 lm 函数中的稳健/聚类标准错误或替换结果

标签 r lm

交叉发布于 CrossValidated .
前阵子问this question ,这是关于在使用 IV/2SLS 时纠正标准错误,并且第一阶段有一个 tobit 分布,我从 jay.sf 得到了一个惊人的答案(底部的示例数据)。他为我提供了以下功能:

vcov2sls <- function(s1, s2, data, type=2) {
  ## get y names
  y1.nm <- gsub(".*=\\s(.*)(?=\\s~).*", "\\1", deparse(s1$call)[1], perl=TRUE)
  y2.nm <- as.character(s2$terms)[2]
  ## auxilliary model matrix
  X <- cbind(`(Intercept)`=1, data[, y1.nm, F], model.matrix(s2)[,-(1:2)])
  ## get y
  y <- data[, y2.nm] 
  ## betas second stage
  b <- s2$coefficients
  ## calculate corrected sums of squares
  sse <- sum((y - b %*% t(X))^2)
  rmse <- sqrt(mean(s2$residuals^2))  ## RMSE 2nd stage
  V0 <- vcov(s2)  ## biased vcov 2nd stage
  dof <- s2$df.residual  ## degrees of freedom 2nd stage
  ## calculate corrected RMSE
  rmse.c <- sqrt(sse/dof)
  ## calculate corrected vcov
  V <- (rmse.c/rmse)^2 * V0
  return(V)
}
所以我正在寻找的是一个函数,我可以在其中提供 vcov 矩阵( vcov2sls ),并具有鲁棒性和聚类标准误差。然而,它们似乎都属于相同的 vcov 矩阵选项。所以如果我提供一个,我已经必须确保 se 是集群的和健壮的..所以我想我基本上是在问我如何制作 vcov2sls函数具有鲁棒性和聚集性错误。显然,导致相同实际结果的任何其他解决方案也会很棒。
但是我想使用 jay.sf 的函数,当第一阶段是 lm 时,聚类参与摘要( source ),例如:
first_stage_ols <- lm(y~x, data=DT)
summary(first_stage_ols, robust=T)
有没有办法从 lm 函数中纠正标准错误,或者(在结果中替换它们),或者调整 vcov2sls函数还可以解释稳健/聚类标准错误吗?
编辑:我也知道 lmtest:coeftest存在,但我希望能够使用 weights .见 this link .我在 lmtest:coeftest 中无法确定这是否可行.
编辑 I - 尝试测试人员解决方案
所以我尝试测试人员在两种情况下回答。首先,我从 tobit 移动到 lm,反之亦然。
# Tobit -> LM

library(lmtest)
library(sandwich)
## run with lm ##
s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.tobit, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.tobit)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.tobit)

s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))

# WORKS!
反之亦然:
# LM -> tobit

library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote, data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.lm)

s2.tobit <- AER::tobit(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

and then ????

# DOES NOT WORK, NO WAY TO ADD THE VCOV TO TOBIT
编辑结束
编辑 II - 测试 lm_robust 和 manual 之间的 p 值
使用时 lm_robust第一阶段的结果如下:
                Estimate Std. Error    t value    Pr(>|t|)   CI Lower   CI Upper       DF
(Intercept)   25.3890287  2.1726518 11.6857327 0.009184928  15.393996 35.3840612 1.870781
votewon       -0.9900966  2.1099738 -0.4692459 0.687605609 -10.636404  8.6562105 1.882014
industryE     -0.7564888  0.3710393 -2.0388372 0.184868777  -2.434709  0.9217314 1.901678
industryF     -2.6639323  0.3058024 -8.7112866 0.013649538  -4.002800 -1.3250647 1.964416
size          -0.5291956  0.5523497 -0.9580807 0.443894805  -3.036862  1.9784705 1.894753
urbanisationB -1.5851495  2.2454251 -0.7059463 0.554845739 -11.464414  8.2941148 1.954657
urbanisationC -2.7234541  0.3573827 -7.6205532 0.020124544  -4.365749 -1.0811587 1.872744
vote           3.1749142  2.4600297  1.2906000 0.341874112  -9.076839 15.4266675 1.740353
但是,在进行手动计算时,p 值非常不同:
s1.summary

Call:
lm(formula = taxrate ~ votewon + industry + size + urbanisation + 
    vote, data = DF)

Residuals:
     Min       1Q   Median       3Q      Max 
-11.2747  -4.3212  -0.6788   4.3677  10.7369 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    25.3890     2.1506  13.845   <2e-16 ***
votewon        -0.9901     2.1742  -0.676   0.5007    
industryE      -0.7565     0.3492  -0.557   0.5792    
industryF      -2.6639     0.2877  -1.855   0.0668 .  
size           -0.5292     0.5109  -1.250   0.2145    
urbanisationB  -1.5851     2.2311  -1.098   0.2753    
urbanisationC  -2.7235     0.3474  -1.704   0.0918 .  
vote            3.1749     2.4840   2.105   0.0380 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 5.623 on 92 degrees of freedom
Multiple R-squared:  0.1054,    Adjusted R-squared:  0.03734 
F-statistic: 1.549 on 7 and 92 DF,  p-value: 0.1609
而这只是第一阶段。
示例数据
DF <- structure(list(country = c("C", "C", "C", "C", "J", "J", "B", 
"B", "F", "F", "E", "E", "D", "D", "F", "F", "I", "I", "J", "J", 
"E", "E", "C", "C", "I", "I", "I", "I", "I", "I", "C", "C", "H", 
"H", "J", "J", "G", "G", "J", "J", "I", "I", "C", "C", "D", "D", 
"A", "A", "G", "G", "E", "E", "J", "J", "G", "G", "I", "I", "I", 
"I", "J", "J", "G", "G", "E", "E", "G", "G", "E", "E", "F", "F", 
"I", "I", "B", "B", "E", "E", "H", "H", "B", "B", "A", "A", "I", 
"I", "I", "I", "F", "F", "E", "E", "I", "I", "J", "J", "D", "D", 
"F", "F"), year = c(2005, 2010, 2010, 2005, 2005, 2010, 2010, 
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010, 
2005, 2010, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2010, 
2005, 2010, 2005, 2005, 2010, 2010, 2005, 2005, 2010, 2005, 2010, 
2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010, 2010, 2005, 2010, 
2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005, 2010, 2005, 
2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 2010, 2005, 
2010, 2005, 2010, 2005, 2010, 2005, 2010, 2010, 2005, 2005, 2010, 
2005, 2010, 2010, 2005, 2010, 2005, 2010, 2005, 2005, 2010, 2005, 
2010, 2010, 2005, 2010, 2005), sales = c(15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9, 15.48, 12.39, 3.72, 
23.61, 4, 31.87, 25.33, 7.64, -0.26, 2.9), industry = c("D", 
"D", "E", "E", "F", "F", "F", "F", "D", "D", "E", "E", "D", "D", 
"E", "E", "F", "F", "F", "F", "D", "D", "F", "F", "E", "E", "D", 
"D", "D", "D", "E", "E", "F", "F", "D", "D", "E", "E", "E", "E", 
"D", "D", "E", "E", "D", "D", "D", "D", "E", "E", "D", "D", "F", 
"F", "D", "D", "D", "D", "E", "E", "D", "D", "E", "E", "D", "D", 
"D", "D", "D", "D", "F", "F", "F", "F", "E", "E", "D", "D", "E", 
"E", "F", "F", "E", "E", "F", "F", "E", "E", "F", "F", "D", "D", 
"D", "D", "D", "D", "D", "D", "F", "F"), urbanisation = c("B", 
"B", "A", "A", "B", "B", "A", "A", "C", "C", "C", "C", "A", "A", 
"B", "B", "C", "C", "A", "A", "C", "C", "B", "B", "A", "A", "A", 
"A", "A", "A", "A", "A", "A", "A", "C", "C", "B", "B", "B", "B", 
"B", "B", "C", "C", "A", "A", "B", "B", "B", "B", "A", "A", "B", 
"B", "A", "A", "A", "A", "B", "B", "C", "C", "A", "A", "C", "C", 
"A", "A", "B", "B", "A", "A", "B", "B", "B", "B", "B", "B", "C", 
"C", "A", "A", "A", "A", "A", "A", "A", "A", "C", "C", "A", "A", 
"B", "B", "A", "A", "B", "B", "B", "B"), size = c(1, 1, 5, 5, 
5, 5, 1, 1, 1, 1, 5, 5, 5, 5, 2, 2, 2, 2, 5, 5, 1, 1, 1, 1, 5, 
5, 5, 5, 4, 4, 5, 5, 5, 5, 4, 4, 2, 2, 5, 5, 1, 1, 1, 1, 2, 2, 
1, 1, 2, 2, 5, 5, 1, 1, 3, 3, 2, 2, 2, 2, 5, 5, 4, 4, 1, 1, 5, 
5, 2, 2, 5, 5, 2, 2, 2, 2, 4, 4, 3, 3, 4, 4, 3, 3, 3, 3, 3, 3, 
5, 5, 3, 3, 2, 2, 3, 3, 1, 1, 5, 5), base_rate = c(14L, 14L, 
14L, 14L, 19L, 19L, 30L, 30L, 20L, 20L, 29L, 29L, 20L, 20L, 20L, 
20L, 24L, 24L, 19L, 19L, 29L, 29L, 14L, 14L, 24L, 24L, 24L, 24L, 
24L, 24L, 14L, 14L, 17L, 17L, 19L, 19L, 33L, 33L, 19L, 19L, 24L, 
24L, 14L, 14L, 20L, 20L, 23L, 23L, 33L, 33L, 29L, 29L, 19L, 19L, 
33L, 33L, 24L, 24L, 24L, 24L, 19L, 19L, 33L, 33L, 29L, 29L, 33L, 
33L, 29L, 29L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 29L, 17L, 17L, 
30L, 30L, 23L, 23L, 24L, 24L, 24L, 24L, 20L, 20L, 29L, 29L, 24L, 
24L, 19L, 19L, 20L, 20L, 20L, 20L), taxrate = c(12L, 14L, 14L, 
12L, 21L, 18L, 30L, 30L, 20L, 20L, 29L, 30L, 20L, 20L, 20L, 20L, 
24L, 24L, 21L, 18L, 30L, 29L, 14L, 12L, 24L, 24L, 24L, 24L, 24L, 
24L, 14L, 12L, 18L, 19L, 18L, 21L, 33L, 32L, 21L, 18L, 24L, 24L, 
12L, 14L, 20L, 20L, 22L, 25L, 32L, 33L, 30L, 29L, 18L, 21L, 32L, 
33L, 24L, 24L, 24L, 24L, 18L, 21L, 32L, 33L, 30L, 29L, 32L, 33L, 
29L, 30L, 20L, 20L, 24L, 24L, 30L, 30L, 29L, 30L, 18L, 19L, 30L, 
30L, 22L, 25L, 24L, 24L, 24L, 24L, 20L, 20L, 30L, 29L, 24L, 24L, 
21L, 18L, 20L, 20L, 20L, 20L), vote = c(0, 0, 0, 0, 1, 1, 1, 
0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 
1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 
1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 
1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 
1, 0, 1, 1, 1, 1, 0, 1, 1), votewon = c(0, 0, 0, 0, 1, 0, 1, 
0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 
1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 
0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 
1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 
0, 0, 1, 1, 0, 1, 0, 1, 1)), class = "data.frame", row.names = c(NA, 
-100L))

## convert variables to factors beforehand
DF[c(1, 2, 4, 5, 6, 9, 10)] <- lapply(DF[c(1, 2, 4, 5, 6, 9, 10)], factor)

s1.tobit <- AER::tobit(taxrate ~ votewon + industry + size + urbanisation + vote,
                  left=12, right=33, data=DF)
yhat <- fitted(s1.tobit)
s2.tobit <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.tobit, vcov.=vcov2sls(s1.tobit, s2.tobit, DF))

最佳答案

编辑:
通过这种方式,您可以在第一阶段运行 lm,调整模型的 SE 并使用它来覆盖 summary(lm) 生成的 SE。然后,您可以估计第二阶段并将您的自定义函数与 coeftest() 一起使用。不确定聚类,但这应该大致有效,不是吗?

library(lmtest)
library(sandwich)
## run with lm ##
s1.lm <- lm(taxrate ~ votewon + industry + size + urbanisation + vote,
          data=DF)
# cluster and adjust ses
s1.robust <-  vcovCL(s1.lm, cluster = ~ industry)
s1.robust.se <- sqrt(diag(s1.robust))

s1.summary <- summary(s1.lm)
s1.summary$coefficients[, 2] <- s1.robust.se

yhat <- fitted(s1.lm)

s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data=DF)

lmtest::coeftest(s2.lm, vcov.=vcov2sls(s1.summary, s2.lm, DF))

看看 estimatr 包,尤其是 lm_robust 。我不是 100% 确定您打算做什么,但是您可以通过运行以下命令获得可靠的标准错误:
library(estimatr)
lm1 <-
  lm_robust(
    mpg ~ hp,
    data = mtcars,
    clusters = cyl,
    weights = wt,
    se_type = "stata" # alternatives: CR0, CR1
  )
summary(lm1)
使用上面的示例,这大致是您要查找的内容吗?请注意,lm_robust 已经调整了 SE。
s1.lm <- lm_robust(taxrate ~ votewon + industry + size + urbanisation + vote,
                   data = DF)
yhat <- fitted(s1.lm)
s2.lm <- lm(sales ~ yhat + industry + size + urbanisation + vote, data = DF)

> lmtest::coeftest(s2.lm, vcov. = vcov2sls(s1.lm, s2.lm, DF))

t test of coefficients:

               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -18.45116   62.14257 -0.2969   0.7672
yhat            1.57784    2.72176  0.5797   0.5636
industryE       0.98174    5.10677  0.1922   0.8480
industryF       2.09036    7.25181  0.2883   0.7738
size2          -8.85327   12.43454 -0.7120   0.4783
size3          -5.74011    7.14973 -0.8028   0.4242
size4         -10.79326   13.14534 -0.8211   0.4138
size5          -3.38280    5.45691 -0.6199   0.5369
urbanisationB  -1.74588    6.34107 -0.2753   0.7837
urbanisationC  -2.00370    6.48533 -0.3090   0.7581
vote1          -1.01661    6.49424 -0.1565   0.8760

> lmtest::coeftest(s2.lm)

t test of coefficients:

               Estimate Std. Error t value Pr(>|t|)
(Intercept)   -18.45116   46.39576 -0.3977   0.6918
yhat            1.57784    2.03207  0.7765   0.4395
industryE       0.98174    3.81273  0.2575   0.7974
industryF       2.09036    5.41421  0.3861   0.7004
size2          -8.85327    9.28365 -0.9536   0.3428
size3          -5.74011    5.33801 -1.0753   0.2851
size4         -10.79326    9.81434 -1.0997   0.2744
size5          -3.38280    4.07414 -0.8303   0.4086
urbanisationB  -1.74588    4.73425 -0.3688   0.7132
urbanisationC  -2.00370    4.84196 -0.4138   0.6800
vote1          -1.01661    4.84861 -0.2097   0.8344
我会根据您的评论更新我的答案。

关于r - 修正 lm 函数中的稳健/聚类标准错误或替换结果,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/65627475/

相关文章:

r - 2 路 FE 模型在 R 中给出 "empty model"

c - 在集群上安装 R 包时出现奇怪的 C 编译器错误

r - 如何调用 data.table 的基于列的子集而不自动对行进行排序?

c++ - 使用 Rcpp 在 R 中使用矩阵是否有限制?

python - R 和 Python 中线性回归的差异

r - 如何仅在两个因变量之一中拟合折线?

r - R中黄土的标准误

r - 如何在树状图中提取具有高度的某个节点下的标签?

linux - 条形图和其他图形未在终端中显示(Debian 和 R)

r - 具有二次多项式和在断点处平滑连接的直线的分段回归