r - 具有挑战性的回归也许可以用循环和 F-stat 来实现

我正在研究一个包含数千个观察值的面板数据集。让我尽可能地简化事情。假设我有以下数据集

set.seed(123)
gdp_usa=runif(16,8,9)
gdp_bel=c(9.22707,  9.245133,   9.272205,   9.31063,    9.339993,   9.364777,   9.376749,   
      9.364378, 9.393332,   9.447258,   9.491499,   9.537432,   9.572997,   9.631823,
      9.657445, 9.680416)
pot_usa = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
pot_bel=c(0,    0,  0,  0,  0,  0,  0,  1,  1,  0,  0,  0,  0,  0,  0,  0) 
df=data.frame(country=c(rep("BEL",16),rep("USA",16)),year=c(rep(1990:2005,2)),gdp=c(gdp_bel,gdp_usa),
          potential=c(pot_bel,pot_usa))

我要做的是对每个国家进行以下回归，可能是循环的:

只是为了让事情更清楚。对于 BEL，year_1={1997,1998}，即“潜在”变量等于 1 的对应年份。这意味着对于比利时，我应该运行 2 个回归:

R1:在 X1,X2 上回归 Y，其中 Y=gdp_bel 从 1990 年到 2004 年，X1={1990,1991,...,2004} 且 X2={0,0,0,0,0,0, 0,0,1,2,3,4,5,6,7}

R2:对 X1,X2 进行 Y 回归，其中 Y_bis=gdp_bel 从 1991 年到 2005 年，X1_bis={1991,...,2005} 且 X2_bis={0,0,0,0,0,0,0,0 ,1,2,3,4,5,6,7}

然后，我计算每个回归的 F-stat，如下所示:

test_result <- anova(R1, update(R1, . ~ . - X2))

还有:

test_result2 <- anova(R2, update(R2, . ~ . - X2_bis))

然后，我将选择 F 统计数据最高的起始年份。

如何通过考虑以下因素来有效地编写此过程:

-我有近 200 个不同的国家/地区

-对于 $i >n$，我可以有一个非竞争性的起始年。例如，对于 BEL，如果数据跨度截至 2022 年，并且 2011 年和 2013 年的潜力=1，那么我将有 2 个起始年份，其中一个将是 1997 年至 1998 年之间的获胜者(F 统计值最高的年份)另一位将是 2011 年至 2013 年期间的获胜者。

更新

根据您的建议，我得到了我想要的东西:

# Function to perform regression and return F-statistic
perform_regression <- function(data, year_i, n) {
  data$year_diff <- pmax(data$year - year_i, 0)
  simple_reg <- lm(gdp ~ year, data = data)
  complex_reg <- lm(gdp ~ year + year_diff, data = data)
  test_result <- anova(simple_reg, complex_reg)
  return(test_result$F[2])  # Return the F-statistic for the complex model
}

# Get unique country names
unique_countries <- unique(df$country)

# Loop through each country
for (country in unique_countries) {
  country_data <- df[df$country == country, ]
  
  # Get potential starting years
  potential_starting_years <- unique(country_data$year[country_data$potential == 1])
  
  best_f_statistic <- -Inf
  best_starting_year <- NA
  
  cat("Country:", country, "\n")
  
  # Loop through potential starting years
  for (year_i in potential_starting_years) {
    filtered_data <- country_data[abs(country_data$year - year_i) <= 7, ]
    f_statistic <- perform_regression(filtered_data, year_i, n = 7)
    
    cat("Year_i:", year_i, "F-statistic:", f_statistic, "\n")
    
    if (f_statistic > best_f_statistic) {
      best_f_statistic <- f_statistic
      best_starting_year <- year_i
    }
  }
  
  cat("Best Starting Year:", best_starting_year, "\n")
  cat("Best F-statistic:", best_f_statistic, "\n\n")
}
df

最后一点应该是得到这样的结果:

country year      gdp potential  fstat    max
1      BEL 1990 9.227070         0     NA     NA
2      BEL 1991 9.245133         0     NA     NA
3      BEL 1992 9.272205         0     NA     NA
4      BEL 1993 9.310630         0     NA     NA
5      BEL 1994 9.339993         0     NA     NA
6      BEL 1995 9.364777         0     NA     NA
7      BEL 1996 9.376749         0     NA     NA
8      BEL 1997 9.364378         1 25.330 34.380
9      BEL 1998 9.393332         1 34.380 34.380
10     BEL 1999 9.447258         0     NA     NA
11     BEL 2000 9.491499         0     NA     NA
12     BEL 2001 9.537432         0     NA     NA
13     BEL 2002 9.572997         0     NA     NA
14     BEL 2003 9.631823         0     NA     NA
15     BEL 2004 9.657445         0     NA     NA
16     BEL 2005 9.680416         0     NA     NA
17     USA 1990 8.287578         0     NA     NA
18     USA 1991 8.788305         0     NA     NA
19     USA 1992 8.408977         0     NA     NA
20     USA 1993 8.883017         0     NA     NA
21     USA 1994 8.940467         0     NA     NA
22     USA 1995 8.045556         0     NA     NA
23     USA 1996 8.528105         0     NA     NA
24     USA 1997 8.892419         1  0.945  0.945
25     USA 1998 8.551435         0     NA     NA
26     USA 1999 8.456615         0     NA     NA
27     USA 2000 8.956833         0     NA     NA
28     USA 2001 8.453334         0     NA     NA
29     USA 2002 8.677571         0     NA     NA
30     USA 2003 8.572633         0     NA     NA
31     USA 2004 8.102925         0     NA     NA
32     USA 2005 8.899825         0     NA     NA

有什么建议吗？

最佳答案

我不知道这是否是数据问题，或者您考虑从方差分析结果中提取 fstat 第一个值的方式。所以我将其保留为开放式，因为我只打印循环值，而不选择或返回，或编译最好的值。但我认为我改进了数据部分，为您提供每个潜在日期前后的七年

set.seed(123)
gdp_bel <- c(9.22707, 9.245133, 9.272205, 9.31063, 9.339993, 9.364777, 9.376749,
             9.364378, 9.393332, 9.447258, 9.491499, 9.537432, 9.572997, 9.631823,
             9.657445, 9.680416)
pot_bel <- c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)
pot_usa <- c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)

df <- data.frame(
  country = c(rep("BEL", 16), rep("USA", 16)),
  year = c(rep(1990:2005, 2)),
  gdp = c(gdp_bel, runif(16, min = 7, max = 9)),
  potential = c(pot_bel, pot_usa)
)



# Function to perform regression and return F-statistic
perform_regression <- function(data, year_i, n) {
  data$year_diff <- pmax(data$year - year_i,0)
  simple_reg <- lm(gdp ~ year  , data = data)
  complx_reg <- lm(gdp ~ year +year_diff , data = data)
  simple_fstat <- summary(simple_reg)$fstat["value"]
  complex_fstat <- summary(complx_reg)$fstat["value"]
  test_result <- anova(simple_reg,complx_reg)
  test_result$F
  cat("\n year_i n ", year_i , " ", n,
      "\nsimple F : ",simple_fstat,
      "\ncomplex F : ",complex_fstat,
      "\ntest res 1 F : " , test_result$F[1],
      "\ntest res 2 F : " , test_result$F[2],
      "\n")
}

# Get unique country names
unique_countries <- unique(df$country)

# Loop through each country
for (country in unique_countries) {
  country_data <- df[df$country == country, ]
  
  # Get potential starting years
  potential_starting_years <- unique(country_data$year[country_data$potential == 1])
  
  best_f_statistic <- -Inf
  best_starting_year <- NA
  print("--------")
  print(country)
  # Loop through potential starting years
  for (year_i in potential_starting_years) {
    filtered_data <- country_data[abs(country_data$year - year_i) <= 7, ]
    f_statistic <- perform_regression(filtered_data, year_i, n = 7)
    # if (f_statistic > best_f_statistic) {
    #   best_f_statistic <- f_statistic
    #   best_starting_year <- year_i
    # }
  }
  # 
  # cat("Country:", country, "\n")
  # cat("Best Starting Year:", best_starting_year, "\n")
  # cat("Best F-statistic:", best_f_statistic, "\n\n")
}

关于r - 具有挑战性的回归也许可以用循环和 F-stat 来实现，我们在Stack Overflow上找到一个类似的问题： https://stackoverflow.com/questions/76914755/

r - 具有挑战性的回归也许可以用循环和 F-stat 来实现

上一篇：python - 处理 python 字典以删除不需要的元素并保留所需的元素

下一篇：r - 如何按位置改变列而不是指定名称