Triangles et provisionnement

March 11, 2013, 10:38 am

≫ Next: Examen intra, régression logistique et de Poisson

≪ Previous: Multiple (smoothed) regression and portfolio exposure

La première partie du cours sur le provisionnement (calcul des provisions pour sinistres à payer) aura lieu dans 10 jours. Les transparents sont en ligne ici, et portent sur la construction des triangles de paiements. La méthode chain ladder (et la formalisation proposée par Thomas Mack) ainsi que les extensions seront présentées. La seconde partie portera sur les méthodes basées sur la régression de Poisson.

Arthur Charpentier

Arthur Charpentier, professor at UQAM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More Posts - Website

Follow Me:

↧

Examen intra, régression logistique et de Poisson

March 11, 2013, 6:19 pm

≫ Next: Examen intra, éléments de correction

≪ Previous: Triangles et provisionnement

L’examen intra du cours ACT2040 aura lieu mercredi matin, de 9:00 à 12:00. Aucun document autorisé, sauf les calculatrices (modèle standard, cf plan de cours), et les téléphones seront formellement interdits. Il y aura 34 questions portant sur la première partie du cours (jusqu’à la fin des modèles de comptage, sections 1 à 5 des transparents). 15 questions porteront sur la base décrite dans un précédant billet, sur le nombre de relations extra-conjugales. Il s’agira de décrire les sorties en ligne ici. Je laisse 36 heures pour prendre connaissance de ces sorties. Une version sera donnée lors de l’examen (imprimée 2 pages par feuille, comme dans la version en ligne: si quelqu’un a besoin d’un exemplaire imprimé plus gros, merci de me le faire savoir avant l’examen).

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Examen intra, éléments de correction

March 19, 2013, 7:32 am

≫ Next: Solvabilité et provisionnement

≪ Previous: Examen intra, régression logistique et de Poisson

L’énoncé de l’examen intra est en pdf ici et comme annoncé par courriel, la correction de l’intra est dans le pdf en ligne. Comme personne ne semble en désaccord avec les réponses proposées, les notes seront mises en ligne très bientôt. Concertant les questions 18 et 19 quelques compléments d’explications (que je n’avais pas tapé dans le pdf). On avait vu que l’estimateur du maximum de vraisemblance pour une régression de Poisson était asymptotiquement Gaussien,

$http://latex.codecogs.com/gif.latex?\widehat{\boldsymbol{\beta}}_{P}\sim\mathcal{N}(\boldsymbol{\beta},V_\infty(\widehat{\boldsymbol{\beta}}_{P}))$

(asymptotiquement) avec

$http://latex.codecogs.com/gif.latex?V_\infty(\widehat{\boldsymbol{\beta}}_{P})=\left(\sum_{i=1}^n%20\widehat%20Y_i%20\boldsymbol{X}_i\boldsymbol{X}_i%27\right)^{-1}$

Quand on a une régression de type binomiale négative, si on note de manière très générale $http://latex.codecogs.com/gif.latex?\omega_i=\text{Var}(Y_i|\boldsymbol{X}_i)$ (on avait vu en cours qu’il existait plusieurs spécifications possibles pour cette variance conditionnelle). Dans ce cas,

$http://latex.codecogs.com/gif.latex?\widehat{\boldsymbol{\beta}}_{BN}\sim\mathcal{N}(\boldsymbol{\beta},V_\infty(\widehat{\boldsymbol{\beta}}_{BN}))$

avec

$http://latex.codecogs.com/gif.latex?V_\infty(\widehat{\boldsymbol{\beta}}_{P})=\left(\sum_{i=1}^n%20\widehat%20Y_i%20\boldsymbol{X}_i\boldsymbol{X}_i%27\right)^{-1}\left[\sum_{i=1}^n%20\omega_i%20\boldsymbol{X}_i\boldsymbol{X}_i\right]\left(\sum_{i=1}^n%20\widehat%20Y_i%20\boldsymbol{X}_i\boldsymbol{X}_i%27\right)^{-1}$

Bref, tout dépend fondamentalement de la spécification de la variance conditionnelle. Sous R, c’est la régression binomiale négative de type 1 qui est considérée, i.e.

$http://latex.codecogs.com/gif.latex?\omega_i=\text{Var}(Y_i|\boldsymbol{X}_i)=\phi\cdot%20\mathbb{E}(Y_i|\boldsymbol{X}_i)=\phi%20\cdot%20\widehat{Y}_i$

On toujours une relation de la forme

$http://latex.codecogs.com/gif.latex?\widehat{\boldsymbol{\beta}}_{QP}\sim\mathcal{N}(\boldsymbol{\beta},V_\infty(\widehat{\boldsymbol{\beta}}_{QP}))$

avec (en simplifiant un peu)

$http://latex.codecogs.com/gif.latex?V_\infty(\widehat{\boldsymbol{\beta}}_{QP})=\phi\cdot\left(\sum_{i=1}^n%20\widehat%20Y_i%20\boldsymbol{X}_i\boldsymbol{X}_i%27\right)^{-1}$

aussi, on a

$http://latex.codecogs.com/gif.latex?V_\infty(\widehat{\boldsymbol{\beta}}_{QP})=\phi\cdot%20V_\infty(\widehat{\boldsymbol{\beta}}_{P})$

Mais comme annoncé en cours, des points étaient données pour ceux qui se contentaient d’affirmer que la variance des estimateurs était plus grande s’il y avait sur-dispersion.

Arthur Charpentier

Arthur Charpentier, professor in Montréal, in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More Posts - Website

Follow Me:

↧

Solvabilité et provisionnement

March 19, 2013, 7:53 am

≫ Next: Chain Ladder, avec R

≪ Previous: Examen intra, éléments de correction

Mercredi, nous allons aborder en cours les aspects de solvabilité des compagnies d’assurance IARD. Plus particulièrement, nous parlerons des provisions pour sinistres à payer, ou “provision for claims outstanding (PCO)” selon la terminologie anglaise, i.e. “the estimated total cost of ultimate settlement of all claims incurred before the date of record, whether reported or not, less any amounts already paid out in respect thereof.” Je renvoie à la lecture de Le contrôle de la solvabilité des compagnies d’assurance en ligne sur le site de l’OCDE, pour une vision globale des approches de ces provisions. La SOA avait publié un rapport en 2009, Comparison of Incurred But Not Reported IBNR Methods que j’encourage à lire.

Nous aborderons mercredi les triangles. Parmi les triangles que nous manipulerons

> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R")

qui contient plusieurs fichiers, dont

> PAID
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3209 4372 4411 4428 4435 4456
[2,] 3367 4659 4696 4720 4730   NA
[3,] 3871 5345 5338 5420   NA   NA
[4,] 4239 5917 6020   NA   NA   NA
[5,] 4929 6794   NA   NA   NA   NA
[6,] 5217   NA   NA   NA   NA   NA

ainsi que le triangle évoqué sur http://rworkingparty.wikidot.com/

> OthLiabData = read.csv("http://www.casact.org/research/reserve_data/othliab_pos.csv",header=TRUE, sep=",")
> library(ChainLadder)
> OL = SumData=ddply(OthLiabData,.(AccidentYear,DevelopmentYear,DevelopmentLag),summarise,IncurLoss=sum(IncurLoss_h1-BulkLoss_h1),
+ CumPaidLoss=sum(CumPaidLoss_h1), EarnedPremDIR=sum(EarnedPremDIR_h1))
> LossTri = as.triangle(OL, origin="AccidentYear",
+ dev = "DevelopmentLag", value="IncurLoss")
> Year = as.triangle(OL, origin="AccidentYear",
+ dev = "DevelopmentLag", value="DevelopmentYear")
> TRIANGLE=LossTri
> TRIANGLE[Year>1997]=NA
> TRIANGLE
      dev
origin      1      2      3      4      5      6      7      8      9     10
  1988 128747 195938 241180 283447 297402 308815 314126 317027 319135 319559
  1989 135147 208767 270979 304488 330066 339871 344742 347800 353245     NA
  1990 152400 238665 297495 348826 359413 364865 372436 372163     NA     NA
  1991 151812 266245 357430 400405 423172 442329 460713     NA     NA     NA
  1992 163737 269170 347469 381251 424810 451221     NA     NA     NA     NA
  1993 187756 358573 431410 476674 504667     NA     NA     NA     NA     NA
  1994 210590 351270 486947 581599     NA     NA     NA     NA     NA     NA
  1995 213141 351363 444272     NA     NA     NA     NA     NA     NA     NA
  1996 237162 378987     NA     NA     NA     NA     NA     NA     NA     NA
  1997 220509     NA     NA     NA     NA     NA     NA     NA     NA     NA

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Chain Ladder, avec R

March 31, 2013, 11:34 am

≫ Next: Reserving with negative increments in triangles

≪ Previous: Solvabilité et provisionnement

Un billet rapide pour mettre en ligne des parties du code tapé en cours, mercredi dernier. On avait commencé par convertir la feuille du classeur excel en un fichier texte, pour faciliter la lecture,

> setwd("C:\\Users\\savsalledecours\\Desktop")
> triangle=read.table("exACT2040.csv",header=TRUE,sep=";")	
> triangle
  ANNEE   X0   X1   X2   X3   X4   X5
1  2000 3209 4372 4411 4428 4435 4456
2  2001 3367 4659 4696 4720 4730   NA
3  2002 3871 5345 5398 5420   NA   NA
4  2003 4239 5917 6020   NA   NA   NA
5  2004 4929 6794   NA   NA   NA   NA
6  2005 5217   NA   NA   NA   NA   NA

L’idée – quand on importe un triangle – est de récupérer une base sous la forme précédente, avec des valeurs manquantes dans la partie inférieure du triangle (on verra l’intérêt quand on fait une régression). On avait ensuite calculé les facteurs de transition, et en même temps complété le triangle,

> T=triangle[,2:7]
> rownames(T)=triangle$ANNEE
> T2=T
> n=ncol(T)
> L=rep(NA,n-1)
> for(j in 1:(n-1)){
+ L[j]=sum(T[1:(n-j),j+1])/sum(T[1:(n-j),j])
+ T2[(n-j+1):n,j+1]=L[j]*T2[(n-j+1):n,j]
+ }

Les facteurs de transition sont ici,

> L
[1] 1.380933 1.011433 1.004343 1.001858 1.004735

et le triangle complété

> T2
       X0       X1       X2       X3       X4       X5
2000 3209 4372.000 4411.000 4428.000 4435.000 4456.000
2001 3367 4659.000 4696.000 4720.000 4730.000 4752.397
2002 3871 5345.000 5398.000 5420.000 5430.072 5455.784
2003 4239 5917.000 6020.000 6046.147 6057.383 6086.065
2004 4929 6794.000 6871.672 6901.518 6914.344 6947.084
2005 5217 7204.327 7286.691 7318.339 7331.939 7366.656

Le montant de provision est ici en faisant la différence entre la charge ultime (dans la dernière colonne) et les derniers paiements observés (sur la seconde diagonale)

> CU=T2[,n]
> Pat=diag(as.matrix(T2[,n:1]))
> Ri=CU-Pat
> R=sum(Ri)

soit, numériquement

> R
[1] 2426.985

On avait alors vu que l’on pouvait calculer un tail factor, en supposant une décroissance exponentielle des facteurs de transition, et on rajoutait alors une colonne correspondant au montant ultime, par année d’accident,

> logL=log(L-1)
> t=1:5
> b=data.frame(logL,t)
> reg=lm(logL~t,data=b)
> logLp=predict(reg,newdata=data.frame(t=6:100))
> Lp=exp(logLp)+1
> Linf=prod(Lp)
> T3=T2
> T3$Xinf=T3$X5*Linf

On a ici

> T3
       X0       X1       X2       X3       X4       X5     Xinf
2000 3209 4372.000 4411.000 4428.000 4435.000 4456.000 4459.149
2001 3367 4659.000 4696.000 4720.000 4730.000 4752.397 4755.755
2002 3871 5345.000 5398.000 5420.000 5430.072 5455.784 5459.639
2003 4239 5917.000 6020.000 6046.147 6057.383 6086.065 6090.366
2004 4929 6794.000 6871.672 6901.518 6914.344 6947.084 6951.993
2005 5217 7204.327 7286.691 7318.339 7331.939 7366.656 7371.862

(je laisse reprendre le code pour calculer le montant de provisions). Enfin, on avait montré comment utiliser une régression pondérée, pour calculer les facteurs de transition,

> T4=as.matrix(T$X0,n,1)
> for(j in 1:(n-1)){
+ Y=T[,j+1]
+ X=T[,j]
+ base=data.frame(X,Y)
+ reg=lm(Y~0+X,weights=1/X)
+ T4=cbind(T4,
+ predict(reg,
+ newdata=data.frame(X=T4[,j]
+ )))
+ }

Ce qui donnait la même projection que la méthode Chain Ladder

> T4
  [,1]     [,2]     [,3]     [,4]     [,5]     [,6]
1 3209 4431.414 4482.076 4501.543 4509.909 4531.263
2 3367 4649.601 4702.758 4723.184 4731.961 4754.367
3 3871 5345.591 5406.705 5430.188 5440.279 5466.039
4 4239 5853.775 5920.698 5946.414 5957.464 5985.673
5 4929 6806.619 6884.435 6914.337 6927.186 6959.986
6 5217 7204.327 7286.691 7318.339 7331.939 7366.656

La suite mercredi prochain, même si on risque d’aller très vite sur la méthode de Mack (et les calculs d’erreur quadratique moyenne pour arriver à la régression de Poisson). A suivre donc…

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Reserving with negative increments in triangles

April 11, 2013, 10:41 am

≫ Next: Réassurrance

≪ Previous: Chain Ladder, avec R

A few months ago, I did published a post on negative values in triangles, and how to deal with them, when using a Poisson regression (the post was published in French). The idea was to use a translation technique:

Fit a model not on ‘s but on $http://latex.codecogs.com/gif.latex?Y_i^{(k)}=Y_i+k$ , for some $http://latex.codecogs.com/gif.latex?k\geq%200$ ,
Use that model to make predictions, and then translate those predictions, $http://latex.codecogs.com/gif.latex?\widehat{Y}_i^{(k)}-k$

This is what was done to get the following graph, where a Poisson regression was fitted. Black points are ‘s while blue points are $http://latex.codecogs.com/gif.latex?\widehat{Y}_i^{(k)}$ ‘s, for some $http://latex.codecogs.com/gif.latex?k\geq%200$ . We fit a model to get the blue prediction, and then translate it to get the red prediction (on the ‘s).

In this example, there were no negative values, but it is possible to use it get a better understanding on the impact of this technique. The prediction, here, is the red line. And clearly, the value of has an impact on the prediction (since we do not consider, here, a linear model: with a linear model, translating has not impact at all, except on the intercept).

The alternative mentioned in the previous post was to use this technique on several ‘s, and them interpolate

For a given , fit a model not on ‘s but on $http://latex.codecogs.com/gif.latex?Y_i^{(k)}=Y_i+k$ , use that model to make predictions, and then translate those predictions, $http://latex.codecogs.com/gif.latex?\widehat{Y}_i^{(k)}-k$ .
Do it for several ‘s.
Use it to extrapolate when is (which is the case we are interested in).

In the context of loss reserving, the idea is extremely simple. Consider a triangle with incremental payments

> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R")
> Y=T=PAID
> n=ncol(T)
> Y[,2:n]=T[,2:n]-T[,1:(n-1)]   
> Y
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3209 1163   39   17    7   21
[2,] 3367 1292   37   24   10   NA
[3,] 3871 1474   53   22   NA   NA
[4,] 4239 1678  103   NA   NA   NA
[5,] 4929 1865   NA   NA   NA   NA
[6,] 5217   NA   NA   NA   NA   NA

Now, we do not have negative values, here, but we can still see is translation techniques can be used. The benchmark is the Poisson regression, since we can run it :

> y=as.vector(as.matrix(Y))
> base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n))
> reg=glm(y~as.factor(ai)+as.factor(bj),data=base,family=poisson)

Here, the amount is reserve is the sum of predicted values in the lower part of the triangle,

> py=predict(reg,newdata=base,type="response")
> sum(py[is.na(base$y)])
[1] 2426.985

which is exactly Chain Ladder’s estimate.

Now, let us use a translation technique to compute the amount of reserves. The code will be

> decal=function(k){
+ reg=glm(y+k~as.factor(ai)+as.factor(bj),data=base,family=poisson)
+ py=predict(reg,newdata=base,type="response")
+ return(sum(py[is.na(base$y)]-k))

For instance, if we translate of +5, we would get

> decal(5)
[1] 2454.713

while a translation of +10 would return

> decal(10)
[1] 2482.29

Clearly, translations do have an impact on the estimation. Here, just to check, if we do not translate, we do have Chain Ladder’s estimate,

> decal(0)
[1] 2426.985

The idea mentioned in the previous post was to try several translations, and then extrapolate, to get the value in 0. Here, translations will give the following estimates

> K=10:20
> (V=Vectorize(decal)(K))
 [1] 2482.290 2487.788 2493.279 2498.765 2504.245 2509.719 2515.187 2520.649
 [9] 2526.106 2531.557 2537.001

We can plot those values, and run a regression

> plot(K,V,xlim=c(0,20),ylim=c(2425,2540))
> abline(h=decal(0),col="red",lty=2)

the dotted horizontal line is Chain Ladder. Now, let us extrapolate

> b=data.frame(K=K,D=V)
> rk=lm(D~K,data=b)
> predict(rk,newdata=data.frame(K=0))
       1 
2427.623

On has to admit that it is not that bad. But yesterday evening, Karim asked me why I did use a linear regression, for my extrapolation. And to be honest, I do not know. I mean, the only answer might be that points are almost on a straight line. So the first time I saw it, I was exited, and I ran a linear regression.

Now, let us see if we can do better. Because here, we do use a translation of +10 or +20 (which might be rather small). What if we use much larger values ? (because we might have large negative incremental values). With the following code, we try, each time 11 consecutive values, the smallest one going from 0 to 50,

> hausse=1:50; res=rep(NA,50)
> for(k in hausse){
+ VK=k:(10+k)
+ b=data.frame(K=VK,D=Vectorize(decal)(VK))
+ rk=lm(D~K,data=b)
+ res[k]=predict(rk,newdata=data.frame(K=0))
+ }     
> plot(hausse,res,type="l",col="red",ylim=c(2422,2440))
> abline(rk,col="blue")

Here, we compute reserves when extrapolations were done after 11 translations, from to . With different values of . The case where is ten was the one mentioned above,

> res[hausse==10]
[1] 2427.623

Actually, it might also be possible to consider not 11 translations, but 26, from to . Here, we get

> hausse=1:50; res=rep(NA,50)
> for(k in hausse){
+ VK=k:(25+k)
+ b=data.frame(K=VK,D=Vectorize(decal)(VK))
+ rk=lm(D~K,data=b)
+ res[k]=predict(rk,newdata=data.frame(K=0))
+ }   
> lines(hausse,res,type="l",col="blue",lty=2)

We now have the dotted line

Here, it is getting worst. So let us keep here 11 translations. Perhaps, we can try something different. For instance a Poisson regression, with a log like (i.e. we consider an exponential extrapolation),

> hausse=1:50; res=rep(NA,50)
> for(k in hausse){
+ VK=k:(10+k)
+ b=data.frame(K=VK,D=Vectorize(decal)(VK))
+ rk=glm(D~K,data=b,family=poisson)
+ res[k]=predict(rk,newdata=data.frame(K=0),type="response")
+ }         
> lines(hausse,res,type="l",col="purple")

The purple line will be a Poisson model, with a log link. Perhaps we can try another link function, like a quadratic one

> hausse=1:50; res=rep(NA,50)
> for(k in hausse){
+ VK=k:(10+k)
+ b=data.frame(K=VK,D=Vectorize(decal)(VK))
+ rk=glm(D~K,data=b,family=poisson(link=
+ power(lambda = 2)))
+ res[k]=predict(rk,newdata=data.frame(K=0),type="response")
+ }     
> lines(hausse,res,type="l",col="orange")

That would be the orange line,

Here, we need a link function between identity (the linear model, the blue line) and the quadratic one (the orange one), for instance a power function 3/2,

> hausse=1:50; res=rep(NA,50)
> for(k in hausse){
+ VK=k:(10+k)
+ b=data.frame(K=VK,D=Vectorize(decal)(VK))
+ rk=glm(D~K,data=b,family=poisson(link=
+ power(lambda = 1.5)))
+ res[k]=predict(rk,newdata=data.frame(K=0),type="response")
+ }         
> lines(hausse,res,type="l",col="green")

Here, it looks like we can use that model for any kind of translation, from +10 till +50, even +100 ! But I do not have any intuition about the use of this power function…

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Réassurrance

April 12, 2013, 9:58 am

≫ Next: Overdispersed Poisson et bootstrap

≪ Previous: Reserving with negative increments in triangles

Mercredi aura lieu le dernier cours d’actuariat IARD.

Parmi les compléments, Introduction à la réassurance, publié par Swiss Re, ou ainsi que quelques documents plus techniques, comme The Pareto model in property reinsurance , Exposure rating, ou Designing property reinsurance programmes encore Introduction to reinsurance accounting. Plusieurs réassureurs (et courtiers en réassurance) publient des études techniques sur leurs sites, http://swissre.com/, http://munichre.com/, http://aon.com/, http://scor.com/ ou encore http://guycarp.com/. Sinon je renvois aux notes de cours de Peter Antal, quantitative methods in reinsurance.

Les transparents sont en ligne ici,

Sur les sinistres les plus chers, pour les compagnies d’assurance et de réassurance, http://businessinsider.com/… donne le classement suivant en dollars de 2010 (on pourra aussi consulter http://media.swissre.com/…)

Hurricane Katrina (US, Bahamas, Cuba, Aug. 2005), $ 72.3 billion
Tōhoku earthquake and tsunami (Japan, March 2011), $ 35 billion
Hurricane Andrew (US, Bahamas, August 1992), $ 25 billion
September 11 attacks (US) $ 23.1 billion
Northridge earthquake (US) $ 20.6 billion
Hurricane Ike (US, Haiti, Dominican Republic, Sept. 2005) $ 20.5 billion
Hurricane Ivan (US, Barbados, Sept. 2004) $ 14.9 billion
Hurrican Wilman (US, Mexico, Jamaica, Oct. 2005), $ 14 billion
Hurricane Rita (US, Cuba, Sept. 2005) $ 11.3 billion
Hurricane Charley (US, Cuba, Jamaica) $ 9.3. billion

A titre de comparaison, les chiffres d’affaires des plus gros réassureurs (prime émise en 2010) étaient, selon http://www.insurancenetworking.com/…

Munich Reinsurance Company $ 31.3 billion
Swiss Reinsurance Company Limited $ 24.7 billion
Hannover Rueckversicherung AG $ 15.1 billion
Berkshire Hathaway Inc. $ 14.4 billion
Lloyd’s $ 13 billion
SCOR S.E. $ 8.8 billion
Reinsurance Group of America Inc. $ 7.2 billion
Allianz S.E. $ 5.7 billion
PartnerRe Ltd. $ 4.9 billion
Everest Re Group Ltd. $ 4.2 billion

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Overdispersed Poisson et bootstrap

April 12, 2013, 11:45 am

≫ Next: Provisionnement et tarification, examen final

≪ Previous: Réassurrance

Pour le dernier cours sur les méthodes de provisionnement, on s’est arrête aux méthodes par simulation. Reprenons là où on en était resté au dernier billet où on avait vu qu’en faisant une régression de Poisson sur les incréments, on obtenait exactement le même montant que la méthode Chain Ladder,

> Y
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3209 1163   39   17    7   21
[2,] 3367 1292   37   24   10   NA
[3,] 3871 1474   53   22   NA   NA
[4,] 4239 1678  103   NA   NA   NA
[5,] 4929 1865   NA   NA   NA   NA
[6,] 5217   NA   NA   NA   NA   NA

> y=as.vector(as.matrix(Y))
> base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n))
> reg2=glm(y~as.factor(ai)+as.factor(bj),data=base,family=poisson) 
> summary(reg2)

Call:
glm(formula = y ~ as.factor(ai) + as.factor(bj), family = poisson, 
    data = base)

Coefficients:
                  Estimate Std. Error z value Pr(>|z|)    
(Intercept)        8.05697    0.01551 519.426  < 2e-16 ***
as.factor(ai)2001  0.06440    0.02090   3.081  0.00206 ** 
as.factor(ai)2002  0.20242    0.02025   9.995  < 2e-16 ***
as.factor(ai)2003  0.31175    0.01980  15.744  < 2e-16 ***
as.factor(ai)2004  0.44407    0.01933  22.971  < 2e-16 ***
as.factor(ai)2005  0.50271    0.02079  24.179  < 2e-16 ***
as.factor(bj)1    -0.96513    0.01359 -70.994  < 2e-16 ***
as.factor(bj)2    -4.14853    0.06613 -62.729  < 2e-16 ***
as.factor(bj)3    -5.10499    0.12632 -40.413  < 2e-16 ***
as.factor(bj)4    -5.94962    0.24279 -24.505  < 2e-16 ***
as.factor(bj)5    -5.01244    0.21877 -22.912  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 46695.269  on 20  degrees of freedom
Residual deviance:    30.214  on 10  degrees of freedom
  (15 observations deleted due to missingness)
AIC: 209.52

Number of Fisher Scoring iterations: 4

> base$py2=predict(reg2,newdata=base,type="response")
> round(matrix(base$py2,n,n),1)
       [,1]   [,2] [,3] [,4] [,5] [,6]
[1,] 3155.7 1202.1 49.8 19.1  8.2 21.0
[2,] 3365.6 1282.1 53.1 20.4  8.8 22.4
[3,] 3863.7 1471.8 61.0 23.4 10.1 25.7
[4,] 4310.1 1641.9 68.0 26.1 11.2 28.7
[5,] 4919.9 1874.1 77.7 29.8 12.8 32.7
[6,] 5217.0 1987.3 82.4 31.6 13.6 34.7

> sum(base$py2[is.na(base$y)])
[1] 2426.985

Le plus intéressant est peut être de noter que la loi de Poisson présente probablement trop peu de variance…

> reg2b=glm(y~as.factor(ai)+as.factor(bj),data=base,family=quasipoisson)
> summary(reg2)

Call:
glm(formula = y ~ as.factor(ai) + as.factor(bj), family = quasipoisson, 
    data = base)

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        8.05697    0.02769 290.995  < 2e-16 ***
as.factor(ai)2001  0.06440    0.03731   1.726 0.115054    
as.factor(ai)2002  0.20242    0.03615   5.599 0.000228 ***
as.factor(ai)2003  0.31175    0.03535   8.820 4.96e-06 ***
as.factor(ai)2004  0.44407    0.03451  12.869 1.51e-07 ***
as.factor(ai)2005  0.50271    0.03711  13.546 9.28e-08 ***
as.factor(bj)1    -0.96513    0.02427 -39.772 2.41e-12 ***
as.factor(bj)2    -4.14853    0.11805 -35.142 8.26e-12 ***
as.factor(bj)3    -5.10499    0.22548 -22.641 6.36e-10 ***
as.factor(bj)4    -5.94962    0.43338 -13.728 8.17e-08 ***
as.factor(bj)5    -5.01244    0.39050 -12.836 1.55e-07 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for quasipoisson family taken to be 3.18623)

    Null deviance: 46695.269  on 20  degrees of freedom
Residual deviance:    30.214  on 10  degrees of freedom
  (15 observations deleted due to missingness)
AIC: NA

Number of Fisher Scoring iterations: 4

Mais on en reparlera dans un instant. Ensuite, on avait commencé à regarder erreurs commises, sur la partie supérieure du triangle. Classiquement, par construction, les résidus de Pearson sont de la forme

$http://latex.codecogs.com/gif.latex?\varepsilon_i=\frac{Y_i-\widehat{Y}_i}{\sqrt{\text{Var}(Y_i)}}$

On avait vu dans le cours de tarification que la variance au dénominateur pouvait être remplacé par le prévision, puisque dans un modèle de Poisson, l’espérance et la variance sont identiques. Donc on considérait

$http://latex.codecogs.com/gif.latex?\varepsilon_i=\frac{Y_i-\widehat{Y}_i}{\sqrt{\widehat{Y}_i}}$

> base$erreur=(base$y-base$py2)/sqrt(base$py2)
> round(matrix(base$erreur,n,n),1)
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,]  0.9 -1.1 -1.5 -0.5 -0.4    0
[2,]  0.0  0.3 -2.2  0.8  0.4   NA
[3,]  0.1  0.1 -1.0 -0.3   NA   NA
[4,] -1.1  0.9  4.2   NA   NA   NA
[5,]  0.1 -0.2   NA   NA   NA   NA
[6,]  0.0   NA   NA   NA   NA   NA

Le soucis est que si $http://latex.codecogs.com/gif.latex?\widehat{Y}_i$ est – asymptotiquement – un bon estimateur pour $http://latex.codecogs.com/gif.latex?\text{Var}(Y_i)$ , ce n’est pas le cas à distance finie, car on a alors un estimateur biaisé pour la variance, et donc la variance des résidus n’a que peu de chances d’être de variance unitaire. Aussi, il convient de corriger l’estimateur de la variance, et on pose alors

$http://latex.codecogs.com/gif.latex?\varepsilon_i=\sqrt{\frac{n}{n-k}}\cdot\frac{Y_i-\widehat{Y}_i}{\widehat{Y}_i}$

qui sont alors les résidus de Pearson tel qu’on doit les utiliser.

> E=base$erreur[is.na(base$y)==FALSE]*sqrt(21/(21-11))
> E
 [1]  1.374976e+00  3.485024e-02  1.693203e-01 -1.569329e+00  1.887862e-01
 [6] -1.459787e-13 -1.634646e+00  4.018940e-01  8.216186e-02  1.292578e+00
[11] -3.058764e-01 -2.221573e+00 -3.207593e+00 -1.484151e+00  6.140566e+00
[16] -7.100321e-01  1.149049e+00 -4.307387e-01 -6.196386e-01  6.000048e-01
[21] -8.987734e-15
> boxplot(E,horizontal=TRUE)

En rééchantillonnant dans ces résidus, on peut alors générer un pseudo triangle. Pour des raisons de simplicités, on va générer un peu rectangle, et se restreindre à la partie supérieure,

> Eb=sample(E,size=36,replace=TRUE)
> Yb=base$py2+Eb*sqrt(base$py2)
> Ybna=Yb
> Ybna[is.na(base$y)]=NA
> Tb=matrix(Ybna,n,n)
> round(matrix(Tb,n,n),1)
       [,1]   [,2] [,3] [,4] [,5] [,6]
[1,] 3115.8 1145.4 58.9 46.0  6.4 26.9
[2,] 3179.5 1323.2 54.5 21.3 12.2   NA
[3,] 4245.4 1448.1 61.0  7.9   NA   NA
[4,] 4312.4 1581.7 68.7   NA   NA   NA
[5,] 4948.1 1923.9   NA   NA   NA   NA
[6,] 4985.3     NA   NA   NA   NA   NA

Cette fois, on a un nouveau triangle ! on va alors pouvoir faire plusieurs choses,

compléter le triangle para la méthode Chain Ladder, c’est à dire calculer les montants moyens que l’on pense payer dans les années futures
générer des scénarios de paiements pour les années futurs, en générant des paiements suivant des lois de Poisson (centrées sur les montants moyens que l’on vient de calculer)
générer des scénarios de paiements avec des lois présentant plus de variance que la loi de Poisson. Idéalement, on voudrait simuler des lois qusi-Poisson, mais ce ne sont pas de vraies lois. Par contre, on peut se rappeler que dans ce cas, la loi Gamma devrait donner une bonne approximation.

Pour ce dernier point, on va utiliser le code suivant, pour générer des quasi lois,

> rqpois = function(n, lambda, phi, roundvalue = TRUE) {
+ b = phi
+ a = lambda/phi
+ r = rgamma(n, shape = a, scale = b)
+ if(roundvalue){r=round(r)}
+ return(r)
+ }

Je renvois aux diverses notes de cours pour plus de détails sur la justification, ou à un vieux billet. On va alors faire une petite fonction, qui va soit somme les paiements moyens futurs, soit sommer des générations de scénarios de paiements, à partir d’un triangle,

> CL=function(Tri){
+ y=as.vector(as.matrix(Tri))
+ base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n))
+ reg=glm(y~as.factor(ai)+as.factor(bj),data=base,family=quasipoisson)
+ py2=predict(reg,newdata=base,type="response")
+ pys=rpois(36,py2)
+ pysq=rqpois(36,py2,phi=3.18623)
+ return(list(
+ cl=sum(py2[is.na(base$y)]),
+ sc=sum(pys[is.na(base$y)]),
+ scq=sum(pysq[is.na(base$y)])))
+ }

Reste alors à générer des paquets de triangles. Toutefois, il est possible de générer des triangles avec des incréments négatifs. Pour faire simple, on mettra des valeurs nulles quand on a un paiement négatif. L’impact sur les quantiles sera alors (a priori) négligeable.

> for(s in 1:1000){
+ Eb=sample(E,size=36,replace=TRUE)*sqrt(21/(21-11))
+ Yb=base$py2+Eb*sqrt(base$py2)
+ Yb=pmax(Yb,0)
+ scY=rpois(36,Yb)
+ Ybna=Yb
+ Ybna[is.na(base$y)]=NA
+ Tb=matrix(Ybna,6,6)
+ C=CL(Tb)
+ VCL[s]=C$cl
+ VR[s]=C$sc
+ VRq[s]=C$scq
+ }

Si on regarde la distribution du best estimate, on obtient

> hist(VCL,proba=TRUE,col="light blue",border="white",ylim=c(0,0.003))
> boxplot(VCL,horizontal=TRUE,at=.0023,boxwex = 0.0006,add=TRUE,col="light green")
> D=density(VCL)
> lines(D)
> I=which(D$x<=quantile(VCL,.05))
> polygon(c(D$x[I],rev(D$x[I])),c(D$y[I],rep(0,length(I))),col="blue",border=NA)
> I=which(D$x>=quantile(VCL,.95))
> polygon(c(D$x[I],rev(D$x[I])),c(D$y[I],rep(0,length(I))),col="blue",border=NA)

Mais on peut aussi visualiser des scénarios basés sur des lois de Poisson (équidispersé) ou des scénarios de lois quasiPoisson (surdispersées), ci-dessous

Dans ce dernier cas, on peut en déduire le quantile à 99% des paiements à venir.

> quantile(VRq,.99)
    99% 
2855.01

Il faut donc augmenter le montant de provisions de l’ordre 15% pour s’assurer que la compagnie pourra satisfaire ses engagements dans 99% des cas,

> quantile(VRq,.99)-2426.985
    99% 
428.025

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Provisionnement et tarification, examen final

April 24, 2013, 1:20 pm

≫ Next: Poisson regression on non-integers

≪ Previous: Overdispersed Poisson et bootstrap

L’examen final du cours ACT2040 avait lieu ce matin. L’énoncé est en ligne ainsi que des éléments de correction. En cas d’erreurs, merci de me le faire savoir rapidement, avant que je ne saisisse les notes.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Poisson regression on non-integers

May 7, 2013, 10:44 am

≫ Next: R tutorials

≪ Previous: Provisionnement et tarification, examen final

In the course on claims reserving techniques, I did mention the use of Poisson regression, even if incremental payments were not integers. For instance, we did consider incremental triangles

>  source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R")
>  INC=PAID
>  INC[,2:6]=PAID[,2:6]-PAID[,1:5]
>  INC
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3209 1163   39   17    7   21
[2,] 3367 1292   37   24   10   NA
[3,] 3871 1474   53   22   NA   NA
[4,] 4239 1678  103   NA   NA   NA
[5,] 4929 1865   NA   NA   NA   NA
[6,] 5217   NA   NA   NA   NA   NA

On those payments, it is natural to use a Poisson regression, to predict future payments

>  Y=as.vector(INC)
>  D=rep(1:6,each=6)
>  A=rep(2001:2006,6)
>  base=data.frame(Y,D,A)
>  reg=glm(Y~as.factor(D)+as.factor(A),data=base,family=poisson(link="log"))
>  Yp=predict(reg,type="response",newdata=base)
>  matrix(Yp,6,6)
       [,1]   [,2] [,3] [,4] [,5] [,6]
[1,] 3155.6 1202.1 49.8 19.1  8.2 21.0
[2,] 3365.6 1282.0 53.1 20.4  8.7 22.3
[3,] 3863.7 1471.8 60.9 23.4 10.0 25.7
[4,] 4310.0 1641.8 68.0 26.1 11.2 28.6
[5,] 4919.8 1874.1 77.6 29.8 12.8 32.7
[6,] 5217.0 1987.3 82.3 31.6 13.5 34.7

and the total amount of reserves would be

>  sum(Yp[is.na(Y)==TRUE])
[1] 2426.985

Here, payments were in ’000 euros. What if they were in ’000’000 euros ?

> a=1000
> INC/a
      [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
[1,] 3.209 1.163 0.039 0.017 0.007 0.021
[2,] 3.367 1.292 0.037 0.024 0.010    NA
[3,] 3.871 1.474 0.053 0.022    NA    NA
[4,] 4.239 1.678 0.103    NA    NA    NA
[5,] 4.929 1.865    NA    NA    NA    NA
[6,] 5.217    NA    NA    NA    NA    NA

We can still run a regression here

> reg=glm((Y/a)~as.factor(D)+as.factor(A),data=base,family=poisson(link="log"))
> Yp=predict(reg,type="response",newdata=base)
> sum(Yp[is.na(Y)==TRUE])*a
[1] 2426.985

and the prediction is exactly the same. Actually, it is possible to change currency, and multiply by any kind of constant, the Poisson regression will return always the same prediction, if we use a log link function,

>  homogeneity=function(a=1){
+  reg=glm((Y/a)~as.factor(D)+as.factor(A), data=base,family=poisson(link="log"))
+  Yp=predict(reg,type="response",newdata=base)
+  return(sum(Yp[is.na(Y)==TRUE])*a)
+  }
>  Vectorize(homogeneity)(10^(seq(-3,5)))
[1] 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985

The trick here come from the fact that we do like the Poisson interpretation. But GLMs simply mean that we do want to solve a first order condition. It is possible to solve explicitly the first order condition, which was obtained without any condition such that values should be integers. To run a simple code, the intercept should be related to the last value of the matrix, not the first one.

> base$D=relevel(as.factor(base$D),"6")
> base$A=relevel(as.factor(base$A),"2006")
> reg=glm(Y~as.factor(D)+as.factor(A), data=base,family=poisson(link="log"))
> summary(reg)

Call:
glm(formula = Y ~ as.factor(D) + as.factor(A), family = poisson(link = "log"), 
    data = base)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.3426  -0.4996   0.0000   0.2770   3.9355  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       3.54723    0.21921  16.182  < 2e-16 ***
as.factor(D)1     5.01244    0.21877  22.912  < 2e-16 ***
as.factor(D)2     4.04731    0.21896  18.484  < 2e-16 ***
as.factor(D)3     0.86391    0.22827   3.785 0.000154 ***
as.factor(D)4    -0.09254    0.25229  -0.367 0.713754    
as.factor(D)5    -0.93717    0.32643  -2.871 0.004092 ** 
as.factor(A)2001 -0.50271    0.02079 -24.179  < 2e-16 ***
as.factor(A)2002 -0.43831    0.02045 -21.433  < 2e-16 ***
as.factor(A)2003 -0.30029    0.01978 -15.184  < 2e-16 ***
as.factor(A)2004 -0.19096    0.01930  -9.895  < 2e-16 ***
as.factor(A)2005 -0.05864    0.01879  -3.121 0.001799 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 46695.269  on 20  degrees of freedom
Residual deviance:    30.214  on 10  degrees of freedom
  (15 observations deleted due to missingness)
AIC: 209.52

The first idea is to run a gradient descent, as follows (the starting point will be coefficients from a linear regression on the log of the observations),

> YNA <- Y
> XNA=matrix(0,length(Y),1+5+5)
> XNA[,1]=rep(1,length(Y))
>   for(k in 1:5) XNA[(k-1)*6+1:6,k+1]=k
>   u=(1:(length(Y))%%6); u[u==0]=6
>   for(k in 1:5) XNA[u==k,k+6]=k 
>     YnoNA=YNA[is.na(YNA)==FALSE]
>     XnoNA=XNA[is.na(YNA)==FALSE,]
>     beta=lm(log(YnoNA)~0+XnoNA)$coefficients
>     for(s in 1:50){
+     Ypred=exp(XnoNA%*%beta)
+     gradient=t(XnoNA)%*%(YnoNA-Ypred)
+     omega=matrix(0,nrow(XnoNA),nrow(XnoNA));diag(omega)=exp(XnoNA%*%beta) 
+     hessienne=-t(XnoNA)%*%omega%*%XnoNA
+     beta=beta-solve(hessienne)%*%gradient}
> beta
             [,1]
 [1,]  3.54723486
 [2,]  5.01244294
 [3,]  2.02365553
 [4,]  0.28796945
 [5,] -0.02313601
 [6,] -0.18743467
 [7,] -0.50271242
 [8,] -0.21915742
 [9,] -0.10009587
[10,] -0.04774056
[11,] -0.01172840

We are not too far away from the values given by R. Actually, it is just fine if we focus on the predictions

> matrix(exp(XNA%*%beta),6,6))
       [,1]   [,2] [,3] [,4] [,5] [,6]
[1,] 3155.6 1202.1 49.8 19.1  8.2 21.0
[2,] 3365.6 1282.0 53.1 20.4  8.7 22.3
[3,] 3863.7 1471.8 60.9 23.4 10.0 25.7
[4,] 4310.0 1641.8 68.0 26.1 11.2 28.6
[5,] 4919.8 1874.1 77.6 29.8 12.8 32.7
[6,] 5217.0 1987.3 82.3 31.6 13.5 34.7

which are exactly the one obtained above. And here, we clearly see that there is no assumption such as “explained variate should be an integer“. It is also possible to remember that the first order condition is the same as the one we had with a weighted least square model. The problem is that the weights are function of the prediction. But using an iterative algorithm, we should converge,

> beta=lm(log(YnoNA)~0+XnoNA)$coefficients
>  for(i in 1:50){
+ Ypred=exp(XnoNA%*%beta)
+  z=XnoNA%*%beta+(YnoNA-Ypred)/Ypred
+  REG=lm(z~0+XnoNA,weights=Ypred)
+  beta=REG$coefficients
+ }
> 
> beta
     XnoNA1      XnoNA2      XnoNA3      XnoNA4      XnoNA5      XnoNA6
 3.54723486  5.01244294  2.02365553  0.28796945 -0.02313601 -0.18743467

     XnoNA7      XnoNA8      XnoNA9     XnoNA10     XnoNA11 
-0.50271242 -0.21915742 -0.10009587 -0.04774056 -0.01172840

which are the same values as the one we got previously. Here again, the prediction is the same as the one we got from this so-called Poisson regression,

> matrix(exp(XNA%*%beta),6,6)
       [,1]   [,2] [,3] [,4] [,5] [,6]
[1,] 3155.6 1202.1 49.8 19.1  8.2 20.9
[2,] 3365.6 1282.0 53.1 20.4  8.7 22.3
[3,] 3863.7 1471.8 60.9 23.4 10.0 25.7
[4,] 4310.0 1641.8 68.0 26.1 11.2 28.6
[5,] 4919.8 1874.1 77.6 29.8 12.8 32.7
[6,] 5217.0 1987.3 82.3 31.6 13.5 34.7

Again, it works just fine because GLMs are mainly conditions on the first two moments, and numerical computations are based on the first order condition, which has less constraints than the interpretation in terms of a Poisson model.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

R tutorials

August 14, 2013, 6:20 am

≫ Next: Residuals from a logistic regression

≪ Previous: Poisson regression on non-integers

My course on non-life insurance (ACT2040) will start in a few weeks. I will use R to illustrate predictive modeling. A nice introduction for those who do not know R can be found online.

↧

Residuals from a logistic regression

August 23, 2013, 11:58 am

≫ Next: Assurance IARD

≪ Previous: R tutorials

I always claim that graphs are important in econometrics and statistics ! Of course, it is usually not that simple. Let me come back to a recent experience. A got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals, obtained from a logistic regression ? To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model, this is why I decided to embed the code in some R source file)

> source("http://freakonometrics.free.fr/probit.R")
> reg=glm(Y~X1+X2,family=binomial)

If we use R’s diagnostic plot, the first one is the scatterplot of the residuals, against predicted values (the score actually)

> plot(reg,which=1)

we is simply

> plot(predict(reg),residuals(reg))
> abline(h=0,lty=2,col="grey")

Why do we have those two lines of points ? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points) and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship… We can see more clearly what’s going on when we use colors

> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> abline(h=0,lty=2,col="grey")

Points are exactly on a smooth curve, as a function of the predicted value,

Now, there is nothing from this graph. If we want to understand, we have to run a local regression, to see what’s going on,

> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)

This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend, on the graph about that the plain dark line is very close to the dotted line ?

> rl=lm(residuals(reg)~bs(predict(reg),8))
> #rl=loess(residuals(reg)~predict(reg))
> y=predict(rl,se=TRUE)
> segments(predict(reg),y$fit+2*y$se.fit,predict(reg),y$fit-2*y$se.fit,col="green")

Yes, we can.And even if we have a guess that something can be done, what would this graph suggest ?

Actually, that graph is probably not the only way to look at the residuals. What not plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get

> plot(X2,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X2,residuals(reg)),col="black",lwd=2)
> lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

The graph is similar to the one we had earlier, and against, there is not much to say,

If we now look at the relationship with the first one, it starts to be more interesting,

> plot(X1,residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(X1,residuals(reg)),col="black",lwd=2)
> lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red")
> abline(h=0,lty=2,col="grey")

since we can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect,

Now, if we run a regression including this quadratic effect, what do we have,

> reg=glm(Y~X1+I(X1^2)+X2,family=binomial)
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y])
> lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue")
> lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red")
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
> abline(h=0,lty=2,col="grey")

Actually, it looks like we back where we were initially…. So what is my point ? my point is that

graphs (yes, plural) can be used to see what might go wrong, to get more intuition about possible non linear transformation
graphs are not everything, and they never be perfect ! Here, in theory, to plain line should be a straight line, horizontal. But we also want a model as simple as possible. So, at some stage, we should probably give up, and rely on statistical tests, and confidence intervals. Yes, almost a flat line can be interpreted as flat.

↧

Assurance IARD

August 30, 2013, 8:08 am

≫ Next: Introduction à la régression logistique et aux arbres

≪ Previous: Residuals from a logistic regression

Mercredi aura lieu le premier cours d’assurance IARD de la session (ACT2040). Le plan de cours est en ligne, ainsi que les transparents de la première session. Nous verrons une introduction a la tarification, avec hétérogénéité, et quelques rappels de statistiques.

Des informations suivront, en particulier pour les démonstrations.

↧

Introduction à la régression logistique et aux arbres

September 9, 2013, 7:55 am

≫ Next: Non-observable vs. observable heterogeneity factor

≪ Previous: Assurance IARD

Pour le second cours ACT2040, on va finir l’introduction (et les rappels de statistique inférentiel) puis attaque la première grosse section, sur la régression logistique et aux arbres de classification. base tirée du livre de Jed Frees, http://instruction.bus.wisc.edu/jfrees/…

> baseavocat=read.table("http://freakonometrics.free.fr/AutoBI.csv",
+ header=TRUE,sep=",")
> tail(baseavocat)
     CASENUM ATTORNEY CLMSEX MARITAL CLMINSUR SEATBELT CLMAGE  LOSS
1335   34204        2      2       2        2        1     26 0.161
1336   34210        2      1       2        2        1     NA 0.576
1337   34220        1      2       1        2        1     46 3.705
1338   34223        2      2       1        2        1     39 0.099
1339   34245        1      2       2        1        1     18 3.277
1340   34253        2      2       2        2        1     30 0.688

On dispose d’une variable dichotomique indiquant si un assuré – suite à un accident de la route – a été représenté par un avocat (1 si oui, 2 si non). On connaît le sexe de l’assuré (1 pour les hommes et 2 pour les femmes), le statut marital (1 s’il est marié, 2 s’il est célibataire, 3 pour un veuf, et 4 pour un assuré divorcé). On sait aussi si l’assuré portait ou non une ceinture de sécurité lorsque l’accident s’est produit (1 si oui, 2 si non et 3 si l’information n’est pas connue). Enfin, une information pour savoir si le conducteur du véhicule était ou non assuré (1 si oui, 2 si non et 3 si l’information n’est pas connue). On va recoder un peu les données afin de les rendre plus claires à lire.

Les transparents sont en ligne sur le blog,

Des compléments théoriques sur les arbres peuvent se trouver en ligne http://genome.jouy.inra.fr/…, http://ensmp.fr/…, ou http://ujf-grenoble.fr/… (pour information, nous ne verrons que la méthode CART). Je peux renvoyer au livre (et au blog) de Stéphane Tuffery, ou (en anglais) au livre de Richard Berk, dont un résumé se trouve en ligne sur http://crim.upenn.edu/….

↧

Non-observable vs. observable heterogeneity factor

September 11, 2013, 8:25 pm

≫ Next: Régression de Poisson

≪ Previous: Introduction à la régression logistique et aux arbres

This morning, in the ACT2040 class (on non-life insurance), we’ve discussed the difference between observable and non-observable heterogeneity in ratemaking (from an economic perspective). To illustrate that point (we will spend more time, later on, discussing observable and non-observable risk factors), we looked at the following simple example. Let denote the height of a person. Consider the following dataset

> Davis=read.table(
+ "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Davis.txt")

There is a small typo in the dataset, so let us make manual changes here

> Davis[12,c(2,3)]=Davis[12,c(3,2)]

Here, the variable of interest is the height of a given person,

> X=Davis$height

If we look at the histogram, we have

> hist(X,col="light green", border="white",proba=TRUE,xlab="",main="")

Can we assume that we have a Gaussian distribution ?

Maybe not… Here, if we fit a Gaussian distribution, plot it, and add a kernel based estimator, we get

> (param <- fitdistr(X,"normal")$estimate) 
> f1 <- function(x) dnorm(x,param[1],param[2]) 
> x=seq(100,210,by=.2) 
> lines(x,f1(x),lty=2,col="red") 
> lines(density(X))

If you look at that black line, you might think of a mixture, i.e. something like

(using standard mixture notations). Mixture are obtained when we have a non-observable heterogeneity factor: with probability , we have a random variable (call it type [1]), and with probability , a random variable (call it type [2]). So far, nothing new. And we can fit such a mixture distribution, using e.g.

> library(mixtools) 
> mix <- normalmixEM(X)
 number of iterations= 335 
> (param12 <- c(mix$lambda[1],mix$mu,mix$sigma)) 
[1] 0.4002202 178.4997298 165.2703616 6.3561363 5.9460023

If we plot that mixture of two Gaussian distributions, we get

> f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4])
+ (1-param12[1])*dnorm(x,param12[3],param12[5]) }
> lines(x,f2(x),lwd=2, col="red") lines(density(X))

Not bad. Actually, we can try to maximize the likelihood with our own codes,

> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[5]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,
+ 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5)
> bvec <- c(0,-1,0,0)
> constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)$par

[1]   0.5996263 165.2690084 178.4991624   5.9447675   6.3564746

Here, we include some constraints, to insurance that the probability belongs to the unit interval, and that the variance parameters remain positive. Note that we have something close to the previous output.

Let us try something a little bit more complex now. What if we assume that the underlying distributions have the same variance, namely

In that case, we have to use the previous code, and make small changes,

> logdf <- function(x,parameter){
+ p <- parameter[1]
+ m1 <- parameter[2]
+ s1 <- parameter[4]
+ m2 <- parameter[3]
+ s2 <- parameter[4]
+ return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2)))
+ }
> logL <- function(parameter) -sum(logdf(X,parameter))
> Amat <- matrix(c(1,-1,0,0,0,0,0,0,0,0,0,1), 3, 4)
> bvec <- c(0,-1,0)
> (param12c= constrOptim(c(.5,160,180,10), logL, NULL, ui = Amat, ci = bvec)$par)

[1]   0.6319105 165.6142824 179.0623954   6.1072614

This is what we can do if we cannot observe the heterogeneity factor. But wait… we actually have some information in the dataset. For instance, we have the sex of the person. Now, if we look at histograms of height per sex, and kernel based density estimator of the height, per sex, we have

So, it looks like the height for male, and the height for female are different. Maybe we can use that variable, that was actually observed, to explain the heterogeneity in our sample. Formally, here, the idea is to consider a mixture, with an observable heterogeneity factor: the sex,

We now have interpretation of what we used to call class [1] and [2] previously: male and female. And here, estimating parameters is quite simple,

>  (pM <- mean(sex=="M"))
[1] 0.44
>  (paramF <- fitdistr(X[sex=="F"],"normal")$estimate)
      mean         sd 
164.714286   5.633808 
>  (paramM <- fitdistr(X[sex=="M"],"normal")$estimate)
      mean         sd 
178.011364   6.404001

And if we plot the density, we have

> f4 <- function(x) pM*dnorm(x,paramM[1],paramM[2])+(1-pM)*dnorm(x,paramF[1],paramF[2])
> lines(x,f4(x),lwd=3,col="blue")

What if, once again, we assume identical variance? Namely, the model becomes

Then a natural idea to derive an estimator for the variance, based on previous computations, is to use

The code is here

> s=sqrt((sum((height[sex=="M"]-paramM[1])^2)+sum((height[sex=="F"]-paramF[1])^2))/(nrow(Davis)-2))
> s
[1] 6.015068

and again, it is possible to plot the associated density,

> f5 <- function(x) pM*dnorm(x,paramM[1],s)+(1-pM)*dnorm(x,paramF[1],s)
> lines(x,f5(x),lwd=3,col="blue")

Now, if we think a little about what we’ve just done, it is simply a linear regression on a factor, the sex of the person,

where . And indeed, if we run the code to estimate this linear model,

> summary(lm(height~sex,data=Davis))

Call:
lm(formula = height ~ sex, data = Davis)

Residuals:
     Min       1Q   Median       3Q      Max 
-16.7143  -3.7143  -0.0114   4.2857  18.9886 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 164.7143     0.5684  289.80   <2e-16 ***
sexM         13.2971     0.8569   15.52   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.015 on 198 degrees of freedom
Multiple R-squared:  0.5488,	Adjusted R-squared:  0.5465 
F-statistic: 240.8 on 1 and 198 DF,  p-value: < 2.2e-16

we get the same estimators for the means and the variance as the ones obtained previously. So, as mentioned this morning in class, if you have a non-observable heterogeneity factor, we can use a mixture model to fit a distribution, but if you can get a proxy of that factor, that is observable, then you can run a regression. But most of the time, that observable variable is just a proxy of a non-observable one…

↧

Régression de Poisson

September 23, 2013, 5:40 pm

≫ Next: Logistic regression and categorical covariates

≪ Previous: Non-observable vs. observable heterogeneity factor

Mercredi, on finira les arbres de classification, et on commencera la modélisation de la fréquence de sinistre. Les transparents sont en ligne.

Comme annoncé lors du premier cours, je suggère de commencer la lecture du Practicionner’s Guide to Generalized Linear Models. Le document correspond au minimum attendu dans ce cours.

↧

Logistic regression and categorical covariates

September 26, 2013, 7:16 pm

≫ Next: Nice tutorials to discover R

≪ Previous: Régression de Poisson

A short post to get back – for my nonlife insurance course – on the interpretation of the output of a regression when there is a categorical covariate. Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)
> tail(db)
     Y       X1       X2 X3
995  1 4.801836 20.82947  A
996  1 9.867854 24.39920  C
997  1 5.390730 21.25119  D
998  1 6.556160 20.79811  D
999  1 4.710276 21.15373  A
1000 1 6.631786 19.38083  A

Let us run a logistic regression on that dataset

> reg = glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.45885    1.04646  -4.261 2.04e-05 ***
X1           0.51664    0.11178   4.622 3.80e-06 ***
X2           0.21008    0.07247   2.899 0.003745 ** 
X3B          1.74496    0.49952   3.493 0.000477 ***
X3C         -0.03470    0.35691  -0.097 0.922543    
X3D          0.08004    0.34916   0.229 0.818672    
X3E          2.21966    0.56475   3.930 8.48e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.64  on 999  degrees of freedom
Residual deviance: 397.69  on 993  degrees of freedom
AIC: 411.69

Number of Fisher Scoring iterations: 7

Here, the reference is modality . Which means that for someone with characteristics , we predict the following probability

where denotes the cumulative distribution function of the logistic distribution

For someone with characteristics , we predict the following probability

(etc.) Here, if we accept (against ), it means that modality cannot be considerd as different from .

A natural idea can be to change the reference modality, and to look at the -values. If we consider the following loop, we get

> M = matrix(NA,5,5)
> rownames(M)=colnames(M)=LETTERS[1:5]
> for(k in 1:5){
+ db$X3 = relevel(X3,LETTERS[k])
+ reg = glm(Y~X1+X2+X3,family=binomial,data=db)
+ M[levels(db$X3)[-1],k] = summary(reg)$coefficients[4:7,4]
+ } 
> M
             A            B            C            D            E
A           NA 0.0004771853 9.225428e-01 0.8186723647 8.482647e-05
B 4.771853e-04           NA 4.841204e-04 0.0009474491 4.743636e-01
C 9.225428e-01 0.0004841204           NA 0.7506242347 9.194193e-05
D 8.186724e-01 0.0009474491 7.506242e-01           NA 1.730589e-04
E 8.482647e-05 0.4743636442 9.194193e-05 0.0001730589           NA

and if we simply want to know if the -value exceeds – or not – 5%, we get the following,

> M.TF = M>.05
> M.TF
      A     B     C     D     E
A    NA FALSE  TRUE  TRUE FALSE
B FALSE    NA FALSE FALSE  TRUE
C  TRUE FALSE    NA  TRUE FALSE
D  TRUE FALSE  TRUE    NA FALSE
E FALSE  TRUE FALSE FALSE    NA

The first column is obtained when is the reference, and then, we see which parameter should be considered as null. The interpretation is the following:

and are not different from
is not different from
and are not different from
and are not different from
is not different from

Note that we only have, here, some kind of intuition. So, let us run a more formal test. Let us consider the following regression (we remove the intercept to get a model easier to understand)

> library(car)
> db$X3=relevel(X3,"A")
> reg=glm(Y~0+X1+X2+X3,family=binomial,data=db)
> summary(reg)

Coefficients:
    Estimate Std. Error z value Pr(>|z|)    
X1   0.51664    0.11178   4.622 3.80e-06 ***
X2   0.21008    0.07247   2.899  0.00374 ** 
X3A -4.45885    1.04646  -4.261 2.04e-05 ***
X3E -2.23919    1.06666  -2.099  0.03580 *  
X3D -4.37881    1.04887  -4.175 2.98e-05 ***
X3C -4.49355    1.06266  -4.229 2.35e-05 ***
X3B -2.71389    1.07274  -2.530  0.01141 *

---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1386.29  on 1000  degrees of freedom
Residual deviance:  397.69  on  993  degrees of freedom
AIC: 411.69

Number of Fisher Scoring iterations: 7

It is possible to use Fisher test to test if some coefficients are equal, or not (more generally if some linear constraints are satisfied)

> linearHypothesis(reg,c("X3A=X3C","X3A=X3D","X3B=X3E"))
Linear hypothesis test

Hypothesis:
X3A - X3C = 0
X3A - X3D = 0
- X3E  + X3B = 0

Model 1: restricted model
Model 2: Y ~ 0 + X1 + X2 + X3

  Res.Df Df  Chisq Pr(>Chisq)
1    996                     
2    993  3 0.6191      0.892

Here, we clearly accept the assumption that the first three factors are equal, as well as the last two. What is the next step? Well, if we believe that there are mainly two categories, and , let us create that factor,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)
> summary(reg)

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -4.39439    1.02791  -4.275 1.91e-05 ***
X1           0.51378    0.11138   4.613 3.97e-06 ***
X2           0.20807    0.07234   2.876  0.00402 ** 
X3bisBE      1.94905    0.36852   5.289 1.23e-07 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 552.64  on 999  degrees of freedom
Residual deviance: 398.31  on 996  degrees of freedom
AIC: 406.31

Number of Fisher Scoring iterations: 7

Here, all the categories are significant. So we do have a proper model.

↧

Nice tutorials to discover R

September 28, 2013, 8:20 pm

≫ Next: ROC curves and classification

≪ Previous: Logistic regression and categorical covariates

A series of tutorials, in R, by Anthony Damico. As claimed on http://twotorials.com/, “how to do stuff in r. two minutes or less, for those of us who prefer to learn by watching and listening“. So far,

(I guess there is no need now to make my own….)

↧

ROC curves and classification

September 30, 2013, 8:12 am

≫ Next: Regression on variables, or on categories?

≪ Previous: Nice tutorials to discover R

To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,

> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";")
> attach(db)

The first step is to get a model. For instance, a logistic regression, where some factors were merged together,

> X3bis=rep(NA,length(X3))
> X3bis[X3%in%c("A","C","D")]="ACD"
> X3bis[X3%in%c("B","E")]="BE"
> db$X3bis=as.factor(X3bis)
> reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)

From this model, we can predict a probability, not a variable,

> S=predict(reg,type="response")

Let $http://latex.codecogs.com/gif.latex?\widehat{S}$ denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a variable. As we usually do in decision theory. The idea is to consider a threshold , so that

if , then will be , or “positive” (using a standard terminology)
si , then will be , or “negative“

Then we derive a contingency table, or a confusion matrix

	observed value
predicted value		“positive“	“négative“
	“positive“	TP	FP
	“négative“	FN	TN

where TP are the so-called true positive, TN the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold

> roc.curve=function(s,print=FALSE){
+ Ps=(S>s)*1
+ FP=sum((Ps==1)*(Y==0))/sum(Y==0)
+ TP=sum((Ps==1)*(Y==1))/sum(Y==1)
+ if(print==TRUE){
+ print(table(Observed=Y,Predicted=Ps))
+ }
+ vect=c(FP,TP)
+ names(vect)=c("FPR","TPR")
+ return(vect)
+ }
> threshold = 0.5
> roc.curve(threshold,print=TRUE)
        Predicted
Observed   0   1
       0   5 231
       1  19 745
      FPR       TPR 
0.9788136 0.9751309

Here, we also compute the false positive rates, and the true positive rates,

TPR = TP / P = TP / (TP + FN) also called sentivity, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)

The ROC curve is then obtained using severall values for the threshold. For convenience, define

> ROC.curve=Vectorize(roc.curve)

First, we can plot $http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i)$ (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors

> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1)))
> plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="")
> abline(v=threshold,col="gray")

And for the ROC curve, simply use

> M.ROC=ROC.curve(seq(0,1,by=.01))
> plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")

This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree

> library(tree)
> ctr <- tree(Y~X1+X2+X3bis,data=db)
> plot(ctr)
> text(ctr)

To plot the ROC curve, we just need to use the prediction obtained using this second model,

> S=predict(ctr)

All the code described above can be used. Again, we can plot $http://latex.codecogs.com/gif.latex?(\widehat{S}_i,Y_i)$ (observe that we have 5 possible values for $http://latex.codecogs.com/gif.latex?\widehat{S}_i$ , which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,

An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models

> plot(M.ROC[1,],M.ROC[2,],type="l")
> lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)

The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.

↧

Regression on variables, or on categories?

September 30, 2013, 6:14 pm

≫ Next: Some heuristics about local regression and kernel smoothing

≪ Previous: ROC curves and classification

I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset

> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")

First, let us change the reference in our categorical variable (just to get an easier interpretation later on)

> db$X3=relevel(as.factor(db$X3),ref="E")

If we run a logistic regression on the three variables (two continuous, one categorical), we get

> reg=glm(Y~X1+X2+X3,family=binomial,data=db)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Now, if we use a stepwise procedure, to select variables in the model, we get

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3

       Df Deviance    AIC
- X3    4   587.81 593.81
<none>      582.29 596.29
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=593.81
Y ~ X1 + X2

       Df Deviance    AIC
<none>      587.81 593.81
- X2    1   606.90 610.90
- X1    1   622.44 626.44

So clearly, we should remove the categorical variable if our starting point was the regression on the three variables.

Now, what if we consider the same model, but slightly different: on the five categories,

> X3complete = model.matrix(~0+X3,data=db)
> db2 = data.frame(db,X3complete)
> head(db2)
  Y       X1       X2 X3 X3A X3B X3C X3D X3E
1 1 3.297569 16.25411  B   0   1   0   0   0
2 1 6.418031 18.45130  D   0   0   0   1   0
3 1 5.279068 16.61806  B   0   1   0   0   0
4 1 5.539834 19.72158  C   0   0   1   0   0
5 1 4.123464 18.38634  C   0   0   1   0   0
6 1 7.778443 19.58338  C   0   0   1   0   0

From a technical point of view, it is exactly the same as before, if we look at the regression,

> reg = glm(Y~X1+X2+X3A+X3B+X3C+X3D+X3E,family=binomial,data=db2)
> summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E, family = binomial, 
    data = db2)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.0758   0.1226   0.2805   0.4798   2.0345  

Coefficients: (1 not defined because of singularities)
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.39528    0.86649  -6.227 4.77e-10 ***
X1           0.51618    0.09163   5.633 1.77e-08 ***
X2           0.24665    0.05911   4.173 3.01e-05 ***
X3A         -0.09142    0.32970  -0.277   0.7816    
X3B         -0.10558    0.32526  -0.325   0.7455    
X3C          0.63829    0.37838   1.687   0.0916 .  
X3D         -0.02776    0.33070  -0.084   0.9331    
X3E               NA         NA      NA       NA    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 806.29  on 999  degrees of freedom
Residual deviance: 582.29  on 993  degrees of freedom
AIC: 596.29

Number of Fisher Scoring iterations: 6

Both regressions are equivalent. Now, what about a stepwise selection on this new model?

> step(reg)
Start:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E

Step:  AIC=596.29
Y ~ X1 + X2 + X3A + X3B + X3C + X3D

       Df Deviance    AIC
- X3D   1   582.30 594.30
- X3A   1   582.37 594.37
- X3B   1   582.40 594.40
<none>      582.29 596.29
- X3C   1   585.21 597.21
- X2    1   600.56 612.56
- X1    1   617.25 629.25

Step:  AIC=594.3
Y ~ X1 + X2 + X3A + X3B + X3C

       Df Deviance    AIC
- X3A   1   582.38 592.38
- X3B   1   582.41 592.41
<none>      582.30 594.30
- X3C   1   586.30 596.30
- X2    1   600.58 610.58
- X1    1   617.27 627.27

Step:  AIC=592.38
Y ~ X1 + X2 + X3B + X3C

       Df Deviance    AIC
- X3B   1   582.44 590.44
<none>      582.38 592.38
- X3C   1   587.20 595.20
- X2    1   600.59 608.59
- X1    1   617.64 625.64

Step:  AIC=590.44
Y ~ X1 + X2 + X3C

       Df Deviance    AIC
<none>      582.44 590.44
- X3C   1   587.81 593.81
- X2    1   600.73 606.73
- X1    1   617.66 623.66

What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here

> library(car)
> reg = glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)
> linearHypothesis(reg,c("X3A=X3B","X3A=X3D","X3A=0"))

Linear hypothesis test

Hypothesis:
X3A - X3B = 0
X3A - X3D = 0
X3A = 0

Model 1: restricted model
Model 2: Y ~ X1 + X2 + X3

  Res.Df Df  Chisq Pr(>Chisq)
1    996                     
2    993  3 0.1446      0.986

So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.

↧