Regression on variables, or on categories?

I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")

First, let us change the reference in our categorical variable (just to get an easier interpretation later on)

> db$X3=relevel(as.factor(db$X3),ref="E")

If we run a logistic regression on the three variables (two continuous, one categorical), we get

> reg=glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Now, if we use a stepwise procedure, to select variables in the model, we get

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3

       Df Deviance    AIC
- X3    4   587.81 593.81
<none>      582.29 596.29
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=593.81
Y ~ X1 + X2

       Df Deviance    AIC
<none>      587.81 593.81
- X2    1   606.90 610.90
- X1    1   622.44 626.44

So clearly, we should remove the categorical variable if our starting point was the regression on the three variables.

Now, what if we consider the same model, but slightly different: on the five categories,

> X3complete = model.matrix(~0+X3,data=db)
> db2 = data.frame(db,X3complete)
> head(db2)
  Y       X1       X2 X3 X3A X3B X3C X3D X3E
1 1 3.297569 16.25411  B   0   1   0   0   0
2 1 6.418031 18.45130  D   0   0   0   1   0
3 1 5.279068 16.61806  B   0   1   0   0   0
4 1 5.539834 19.72158  C   0   0   1   0   0
5 1 4.123464 18.38634  C   0   0   1   0   0
6 1 7.778443 19.58338  C   0   0   1   0   0

From a technical point of view, it is exactly the same as before, if we look at the regression,

> reg = glm(Y~X1+X2+X3A+X3B+X3C+X3D+X3E,family=binomial,data=db2)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E, family = binomial, 
    data = db2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
X3E               NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Both regressions are equivalent. Now, what about a stepwise selection on this new model?

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E

Step:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D

       Df Deviance    AIC
- X3D   1   582.30 594.30
- X3A   1   582.37 594.37
- X3B   1   582.40 594.40
<none>      582.29 596.29
- X3C   1   585.21 597.21
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=594.3
Y ~ X1 + X2 + X3A + X3B + X3C

       Df Deviance    AIC
- X3A   1   582.38 592.38
- X3B   1   582.41 592.41
<none>      582.30 594.30
- X3C   1   586.30 596.30
- X2    1   600.58 610.58
- X1    1   617.27 627.27

Step:  AIC=592.38
Y ~ X1 + X2 + X3B + X3C

       Df Deviance    AIC
- X3B   1   582.44 590.44
<none>      582.38 592.38
- X3C   1   587.20 595.20
- X2    1   600.59 608.59
- X1    1   617.64 625.64

Step:  AIC=590.44
Y ~ X1 + X2 + X3C

       Df Deviance    AIC
<none>      582.44 590.44
- X3C   1   587.81 593.81
- X2    1   600.73 606.73
- X1    1   617.66 623.66

What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here

> library(car)
> reg = glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)
> linearHypothesis(reg,c("X3A=X3B","X3A=X3D","X3A=0"))

Linear hypothesis test

Hypothesis:
X3A - X3B = 0
X3A - X3D = 0
X3A = 0

Model 1: restricted model
Model 2: Y ~ X1 + X2 + X3

  Res.Df Df  Chisq Pr(>Chisq)
1    996                     
2    993  3 0.1446      0.986

So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.

Regression on variables, or on categories?

Trending Articles

Vimeo 10.7.0 by Vimeo.com, Inc.

Dibujos para colorear de perros

Long Distance Relationship Tagalog Love Quotes

Kahit may Toyo ka

Re:Mutton Pies (lleechef)

FORECLOSURE OF REAL ESTATE MORTGAGE

Re: lwIP PIC32 port - new title : CycloneTCP a new open source stack for...

Ka dewlynnong Nongkhnum, ka jaka ba itynnat tam ha West Khasi Hills

Vimeo 10.7.1 by Vimeo.com, Inc.

Pahiyas 2013 sa Lucban, Quezon

Autor: Adam

Renos para colorear

Two timer Sad tagalog Love quotes

RE: Mutton Pies (mely)

Mga Tala sa “Unang Siglo ng Nobela sa Filipinas” (2009) ni Virgilio S. Almario

Re: lwIP PIC32 port - new title : CycloneTCP a new open source stack for...

Ka longiing longsem kaba skhem bad kaba khlain ka pynlong kein ia ka...

作者：人与自然和谐相处的世外桃源 - 加拉帕戈斯群岛 Galapagos Islands 游记 - 美国信用卡指南

Vimeo 10.12.2 by Vimeo.com, Inc.

PREMATURE CAMPAIGNING – Meron ba nun?