我尝试在 R 中使用 glm 对 winconsin 乳腺癌数据集实现逻辑回归。我分析了数据集,发现 wbc$V7 包含缺失值。我使用 Hmisc 包估算缺失值并使用 glm 执行逻辑回归
wbc=read.csv(file="https://archive.ics.uci.edu/ml/machine-learning-
databases/breast-cancer-wisconsin/breast-cancer-wisconsin.data",header =
FALSE)
wbc[wbc=='?']=NA #replacing '?' with NA
a=sapply(wbc,function(x) sum(is.na(x))) #analyse the number of NA in each column
print(a)
library(Hmisc)
wbc$V7=impute(wbc$V7,mode) #impute missing values with mode in V7
wbc$V11[wbc$V11==2]=0; #V11 has either '2' or '4' as entries, replacing '2' with '0' and '4' with '1'
wbc$V11[wbc$V11==4]=1;
model <- glm(V11~V2+V3+V4+V5+V6+V7+V8+V9+V10,family=binomial(),data=wbc) #
OUTPUT:
Call: glm(formula = V11 ~ V2 + V3 + V4 + V5 + V6 + V7 + V8 + V9 + V10,
family = binomial(), data = wbc)
Coefficients:
(Intercept) V2 V3 V4 V5 V6
V71 V710
8.6625 0.4511 -0.1013 0.4842 0.2206 0.1684
-18.7466 -14.8168
V72 V73 V74 V75 V76 V77
V78 V79
-17.6684 -16.0272 -15.3552 -16.3765 0.7704 -16.2944
-16.6171 NA
V8 V9 V10
0.5052 0.1144 0.4550
Degrees of Freedom: 698 Total (i.e. Null); 681 Residual
Null Deviance: 900.5
Residual Deviance: 102.9 AIC: 138.9
当 wbc 数据帧只有 V1、V2、V3、V4、V5、V6、V7 列时,为什么输出包含 V71、V710、V72、V73、V74、V75、V76、V77、V78 和 V79 的系数, V8、V9、V10 ?
最佳答案
如果 V7 是一个因素,则在应用 glm 时可能会自动进行伪编码。然后,您的因子的每个类别都会有一个系数。
关于r - R 中插补后的逻辑回归,我们在Stack Overflow上找到一个类似的问题: https://stackoverflow.com/questions/51741419/