I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset
> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")
First, let us change the reference in our categorical variable (just to get an easier interpretation later on)
> db$X3=relevel(as.factor(db$X3),ref="E")
If we run a logistic regression on the three variables (two continuous, one categorical), we get
> reg=glm(Y~X1+X2+X3,family=binomial,data=db) > summary(reg) Call: glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db) Deviance Residuals: Min 1Q Median 3Q Max -3.0758 0.1226 0.2805 0.4798 2.0345 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.39528 0.86649 -6.227 4.77e-10 *** X1 0.51618 0.09163 5.633 1.77e-08 *** X2 0.24665 0.05911 4.173 3.01e-05 *** X3A -0.09142 0.32970 -0.277 0.7816 X3B -0.10558 0.32526 -0.325 0.7455 X3C 0.63829 0.37838 1.687 0.0916 . X3D -0.02776 0.33070 -0.084 0.9331 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 806.29 on 999 degrees of freedom Residual deviance: 582.29 on 993 degrees of freedom AIC: 596.29 Number of Fisher Scoring iterations: 6
Now, if we use a stepwise procedure, to select variables in the model, we get
> step(reg) Start: AIC=596.29 Y ~ X1 + X2 + X3 Df Deviance AIC - X3 4 587.81 593.81 <none> 582.29 596.29 - X2 1 600.56 612.56 - X1 1 617.25 629.25 Step: AIC=593.81 Y ~ X1 + X2 Df Deviance AIC <none> 587.81 593.81 - X2 1 606.90 610.90 - X1 1 622.44 626.44
So clearly, we should remove the categorical variable if our starting point was the regression on the three variables.
Now, what if we consider the same model, but slightly different: on the five categories,
> X3complete = model.matrix(~0+X3,data=db) > db2 = data.frame(db,X3complete) > head(db2) Y X1 X2 X3 X3A X3B X3C X3D X3E 1 1 3.297569 16.25411 B 0 1 0 0 0 2 1 6.418031 18.45130 D 0 0 0 1 0 3 1 5.279068 16.61806 B 0 1 0 0 0 4 1 5.539834 19.72158 C 0 0 1 0 0 5 1 4.123464 18.38634 C 0 0 1 0 0 6 1 7.778443 19.58338 C 0 0 1 0 0
From a technical point of view, it is exactly the same as before, if we look at the regression,
> reg = glm(Y~X1+X2+X3A+X3B+X3C+X3D+X3E,family=binomial,data=db2) > summary(reg) Call: glm(formula = Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E, family = binomial, data = db2) Deviance Residuals: Min 1Q Median 3Q Max -3.0758 0.1226 0.2805 0.4798 2.0345 Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -5.39528 0.86649 -6.227 4.77e-10 *** X1 0.51618 0.09163 5.633 1.77e-08 *** X2 0.24665 0.05911 4.173 3.01e-05 *** X3A -0.09142 0.32970 -0.277 0.7816 X3B -0.10558 0.32526 -0.325 0.7455 X3C 0.63829 0.37838 1.687 0.0916 . X3D -0.02776 0.33070 -0.084 0.9331 X3E NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 806.29 on 999 degrees of freedom Residual deviance: 582.29 on 993 degrees of freedom AIC: 596.29 Number of Fisher Scoring iterations: 6
Both regressions are equivalent. Now, what about a stepwise selection on this new model?
> step(reg) Start: AIC=596.29 Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E Step: AIC=596.29 Y ~ X1 + X2 + X3A + X3B + X3C + X3D Df Deviance AIC - X3D 1 582.30 594.30 - X3A 1 582.37 594.37 - X3B 1 582.40 594.40 <none> 582.29 596.29 - X3C 1 585.21 597.21 - X2 1 600.56 612.56 - X1 1 617.25 629.25 Step: AIC=594.3 Y ~ X1 + X2 + X3A + X3B + X3C Df Deviance AIC - X3A 1 582.38 592.38 - X3B 1 582.41 592.41 <none> 582.30 594.30 - X3C 1 586.30 596.30 - X2 1 600.58 610.58 - X1 1 617.27 627.27 Step: AIC=592.38 Y ~ X1 + X2 + X3B + X3C Df Deviance AIC - X3B 1 582.44 590.44 <none> 582.38 592.38 - X3C 1 587.20 595.20 - X2 1 600.59 608.59 - X1 1 617.64 625.64 Step: AIC=590.44 Y ~ X1 + X2 + X3C Df Deviance AIC <none> 582.44 590.44 - X3C 1 587.81 593.81 - X2 1 600.73 606.73 - X1 1 617.66 623.66
What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here
> library(car) > reg = glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db) > linearHypothesis(reg,c("X3A=X3B","X3A=X3D","X3A=0"))
Linear hypothesis test Hypothesis: X3A - X3B = 0 X3A - X3D = 0 X3A = 0 Model 1: restricted model Model 2: Y ~ X1 + X2 + X3 Res.Df Df Chisq Pr(>Chisq) 1 996 2 993 3 0.1446 0.986
So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.