我正在研究一个包含数千个观察值的面板数据集。 让我尽可能地简化事情。 假设我有以下数据集
set.seed(123)
gdp_usa=runif(16,8,9)
gdp_bel=c(9.22707, 9.245133, 9.272205, 9.31063, 9.339993, 9.364777, 9.376749,
9.364378, 9.393332, 9.447258, 9.491499, 9.537432, 9.572997, 9.631823,
9.657445, 9.680416)
pot_usa = c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
pot_bel=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)
df=data.frame(country=c(rep("BEL",16),rep("USA",16)),year=c(rep(1990:2005,2)),gdp=c(gdp_bel,gdp_usa),
potential=c(pot_bel,pot_usa))
只是为了让事情更清楚。对于 BEL,year_1={1997,1998},即“潜在”变量等于 1 的对应年份。 这意味着对于比利时,我应该运行 2 个回归:
R1:在 X1,X2 上回归 Y,其中 Y=gdp_bel 从 1990 年到 2004 年,X1={1990,1991,...,2004} 且 X2={0,0,0,0,0,0, 0,0,1,2,3,4,5,6,7}
R2:对 X1,X2 进行 Y 回归,其中 Y_bis=gdp_bel 从 1991 年到 2005 年,X1_bis={1991,...,2005} 且 X2_bis={0,0,0,0,0,0,0,0 ,1,2,3,4,5,6,7}
然后,我计算每个回归的 F-stat,如下所示:
test_result <- anova(R1, update(R1, . ~ . - X2))
还有:
test_result2 <- anova(R2, update(R2, . ~ . - X2_bis))
然后,我将选择 F 统计数据最高的起始年份。
如何通过考虑以下因素来有效地编写此过程:
-我有近 200 个不同的国家/地区
-对于 $i >n$,我可以有一个非竞争性的起始年。例如,对于 BEL,如果数据跨度截至 2022 年,并且 2011 年和 2013 年的潜力=1,那么我将有 2 个起始年份,其中一个将是 1997 年至 1998 年之间的获胜者(F 统计值最高的年份)另一位将是 2011 年至 2013 年期间的获胜者。
更新
根据您的建议,我得到了我想要的东西:
# Function to perform regression and return F-statistic
perform_regression <- function(data, year_i, n) {
data$year_diff <- pmax(data$year - year_i, 0)
simple_reg <- lm(gdp ~ year, data = data)
complex_reg <- lm(gdp ~ year + year_diff, data = data)
test_result <- anova(simple_reg, complex_reg)
return(test_result$F[2]) # Return the F-statistic for the complex model
}
# Get unique country names
unique_countries <- unique(df$country)
# Loop through each country
for (country in unique_countries) {
country_data <- df[df$country == country, ]
# Get potential starting years
potential_starting_years <- unique(country_data$year[country_data$potential == 1])
best_f_statistic <- -Inf
best_starting_year <- NA
cat("Country:", country, "\n")
# Loop through potential starting years
for (year_i in potential_starting_years) {
filtered_data <- country_data[abs(country_data$year - year_i) <= 7, ]
f_statistic <- perform_regression(filtered_data, year_i, n = 7)
cat("Year_i:", year_i, "F-statistic:", f_statistic, "\n")
if (f_statistic > best_f_statistic) {
best_f_statistic <- f_statistic
best_starting_year <- year_i
}
}
cat("Best Starting Year:", best_starting_year, "\n")
cat("Best F-statistic:", best_f_statistic, "\n\n")
}
df
最后一点应该是得到这样的结果:
country year gdp potential fstat max
1 BEL 1990 9.227070 0 NA NA
2 BEL 1991 9.245133 0 NA NA
3 BEL 1992 9.272205 0 NA NA
4 BEL 1993 9.310630 0 NA NA
5 BEL 1994 9.339993 0 NA NA
6 BEL 1995 9.364777 0 NA NA
7 BEL 1996 9.376749 0 NA NA
8 BEL 1997 9.364378 1 25.330 34.380
9 BEL 1998 9.393332 1 34.380 34.380
10 BEL 1999 9.447258 0 NA NA
11 BEL 2000 9.491499 0 NA NA
12 BEL 2001 9.537432 0 NA NA
13 BEL 2002 9.572997 0 NA NA
14 BEL 2003 9.631823 0 NA NA
15 BEL 2004 9.657445 0 NA NA
16 BEL 2005 9.680416 0 NA NA
17 USA 1990 8.287578 0 NA NA
18 USA 1991 8.788305 0 NA NA
19 USA 1992 8.408977 0 NA NA
20 USA 1993 8.883017 0 NA NA
21 USA 1994 8.940467 0 NA NA
22 USA 1995 8.045556 0 NA NA
23 USA 1996 8.528105 0 NA NA
24 USA 1997 8.892419 1 0.945 0.945
25 USA 1998 8.551435 0 NA NA
26 USA 1999 8.456615 0 NA NA
27 USA 2000 8.956833 0 NA NA
28 USA 2001 8.453334 0 NA NA
29 USA 2002 8.677571 0 NA NA
30 USA 2003 8.572633 0 NA NA
31 USA 2004 8.102925 0 NA NA
32 USA 2005 8.899825 0 NA NA
有什么建议吗?
最佳答案
我不知道这是否是数据问题,或者您考虑从方差分析结果中提取 fstat 第一个值的方式。所以我将其保留为开放式,因为我只打印循环值,而不选择或返回,或编译最好的值。但我认为我改进了数据部分,为您提供每个潜在日期前后的七年
set.seed(123)
gdp_bel <- c(9.22707, 9.245133, 9.272205, 9.31063, 9.339993, 9.364777, 9.376749,
9.364378, 9.393332, 9.447258, 9.491499, 9.537432, 9.572997, 9.631823,
9.657445, 9.680416)
pot_bel <- c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0)
pot_usa <- c(0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0)
df <- data.frame(
country = c(rep("BEL", 16), rep("USA", 16)),
year = c(rep(1990:2005, 2)),
gdp = c(gdp_bel, runif(16, min = 7, max = 9)),
potential = c(pot_bel, pot_usa)
)
# Function to perform regression and return F-statistic
perform_regression <- function(data, year_i, n) {
data$year_diff <- pmax(data$year - year_i,0)
simple_reg <- lm(gdp ~ year , data = data)
complx_reg <- lm(gdp ~ year +year_diff , data = data)
simple_fstat <- summary(simple_reg)$fstat["value"]
complex_fstat <- summary(complx_reg)$fstat["value"]
test_result <- anova(simple_reg,complx_reg)
test_result$F
cat("\n year_i n ", year_i , " ", n,
"\nsimple F : ",simple_fstat,
"\ncomplex F : ",complex_fstat,
"\ntest res 1 F : " , test_result$F[1],
"\ntest res 2 F : " , test_result$F[2],
"\n")
}
# Get unique country names
unique_countries <- unique(df$country)
# Loop through each country
for (country in unique_countries) {
country_data <- df[df$country == country, ]
# Get potential starting years
potential_starting_years <- unique(country_data$year[country_data$potential == 1])
best_f_statistic <- -Inf
best_starting_year <- NA
print("--------")
print(country)
# Loop through potential starting years
for (year_i in potential_starting_years) {
filtered_data <- country_data[abs(country_data$year - year_i) <= 7, ]
f_statistic <- perform_regression(filtered_data, year_i, n = 7)
# if (f_statistic > best_f_statistic) {
# best_f_statistic <- f_statistic
# best_starting_year <- year_i
# }
}
#
# cat("Country:", country, "\n")
# cat("Best Starting Year:", best_starting_year, "\n")
# cat("Best F-statistic:", best_f_statistic, "\n\n")
}
关于r - 具有挑战性的回归也许可以用循环和 F-stat 来实现,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/76914755/