La première partie du cours sur le provisionnement (calcul des provisions pour sinistres à payer) aura lieu dans 10 jours. Les transparents sont en ligne ici, et portent sur la construction des triangles de paiements. La méthode chain ladder (et la formalisation proposée par Thomas Mack) ainsi que les extensions seront présentées. La seconde partie portera sur les méthodes basées sur la régression de Poisson.
Triangles et provisionnement
Examen intra, régression logistique et de Poisson
L’examen intra du cours ACT2040 aura lieu mercredi matin, de 9:00 à 12:00. Aucun document autorisé, sauf les calculatrices (modèle standard, cf plan de cours), et les téléphones seront formellement interdits. Il y aura 34 questions portant sur la première partie du cours (jusqu’à la fin des modèles de comptage, sections 1 à 5 des transparents). 15 questions porteront sur la base décrite dans un précédant billet, sur le nombre de relations extra-conjugales. Il s’agira de décrire les sorties en ligne ici. Je laisse 36 heures pour prendre connaissance de ces sorties. Une version sera donnée lors de l’examen (imprimée 2 pages par feuille, comme dans la version en ligne: si quelqu’un a besoin d’un exemplaire imprimé plus gros, merci de me le faire savoir avant l’examen).
Examen intra, éléments de correction
L’énoncé de l’examen intra est en pdf ici et comme annoncé par courriel, la correction de l’intra est dans le pdf en ligne. Comme personne ne semble en désaccord avec les réponses proposées, les notes seront mises en ligne très bientôt. Concertant les questions 18 et 19 quelques compléments d’explications (que je n’avais pas tapé dans le pdf). On avait vu que l’estimateur du maximum de vraisemblance pour une régression de Poisson était asymptotiquement Gaussien,
(asymptotiquement) avec
Quand on a une régression de type binomiale négative, si on note de manière très générale (on avait vu en cours qu’il existait plusieurs spécifications possibles pour cette variance conditionnelle). Dans ce cas,
avec
Bref, tout dépend fondamentalement de la spécification de la variance conditionnelle. Sous R, c’est la régression binomiale négative de type 1 qui est considérée, i.e.
On toujours une relation de la forme
avec (en simplifiant un peu)
aussi, on a
Mais comme annoncé en cours, des points étaient données pour ceux qui se contentaient d’affirmer que la variance des estimateurs était plus grande s’il y avait sur-dispersion.
Solvabilité et provisionnement
Mercredi, nous allons aborder en cours les aspects de solvabilité des compagnies d’assurance IARD. Plus particulièrement, nous parlerons des provisions pour sinistres à payer, ou “provision for claims outstanding (PCO)” selon la terminologie anglaise, i.e. “the estimated total cost of ultimate settlement of all claims incurred before the date of record, whether reported or not, less any amounts already paid out in respect thereof.” Je renvoie à la lecture de Le contrôle de la solvabilité des compagnies d’assurance en ligne sur le site de l’OCDE, pour une vision globale des approches de ces provisions. La SOA avait publié un rapport en 2009, Comparison of Incurred But Not Reported IBNR Methods que j’encourage à lire.
Nous aborderons mercredi les triangles. Parmi les triangles que nous manipulerons
> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R")
qui contient plusieurs fichiers, dont
> PAID [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3209 4372 4411 4428 4435 4456 [2,] 3367 4659 4696 4720 4730 NA [3,] 3871 5345 5338 5420 NA NA [4,] 4239 5917 6020 NA NA NA [5,] 4929 6794 NA NA NA NA [6,] 5217 NA NA NA NA NA
ainsi que le triangle évoqué sur http://rworkingparty.wikidot.com/
> OthLiabData = read.csv("http://www.casact.org/research/reserve_data/othliab_pos.csv",header=TRUE, sep=",") > library(ChainLadder) > OL = SumData=ddply(OthLiabData,.(AccidentYear,DevelopmentYear,DevelopmentLag),summarise,IncurLoss=sum(IncurLoss_h1-BulkLoss_h1), + CumPaidLoss=sum(CumPaidLoss_h1), EarnedPremDIR=sum(EarnedPremDIR_h1)) > LossTri = as.triangle(OL, origin="AccidentYear", + dev = "DevelopmentLag", value="IncurLoss") > Year = as.triangle(OL, origin="AccidentYear", + dev = "DevelopmentLag", value="DevelopmentYear") > TRIANGLE=LossTri > TRIANGLE[Year>1997]=NA > TRIANGLE dev origin 1 2 3 4 5 6 7 8 9 10 1988 128747 195938 241180 283447 297402 308815 314126 317027 319135 319559 1989 135147 208767 270979 304488 330066 339871 344742 347800 353245 NA 1990 152400 238665 297495 348826 359413 364865 372436 372163 NA NA 1991 151812 266245 357430 400405 423172 442329 460713 NA NA NA 1992 163737 269170 347469 381251 424810 451221 NA NA NA NA 1993 187756 358573 431410 476674 504667 NA NA NA NA NA 1994 210590 351270 486947 581599 NA NA NA NA NA NA 1995 213141 351363 444272 NA NA NA NA NA NA NA 1996 237162 378987 NA NA NA NA NA NA NA NA 1997 220509 NA NA NA NA NA NA NA NA NA
Chain Ladder, avec R
Un billet rapide pour mettre en ligne des parties du code tapé en cours, mercredi dernier. On avait commencé par convertir la feuille du classeur excel en un fichier texte, pour faciliter la lecture,
> setwd("C:\\Users\\savsalledecours\\Desktop") > triangle=read.table("exACT2040.csv",header=TRUE,sep=";") > triangle ANNEE X0 X1 X2 X3 X4 X5 1 2000 3209 4372 4411 4428 4435 4456 2 2001 3367 4659 4696 4720 4730 NA 3 2002 3871 5345 5398 5420 NA NA 4 2003 4239 5917 6020 NA NA NA 5 2004 4929 6794 NA NA NA NA 6 2005 5217 NA NA NA NA NA
L’idée – quand on importe un triangle – est de récupérer une base sous la forme précédente, avec des valeurs manquantes dans la partie inférieure du triangle (on verra l’intérêt quand on fait une régression). On avait ensuite calculé les facteurs de transition, et en même temps complété le triangle,
> T=triangle[,2:7] > rownames(T)=triangle$ANNEE > T2=T > n=ncol(T) > L=rep(NA,n-1) > for(j in 1:(n-1)){ + L[j]=sum(T[1:(n-j),j+1])/sum(T[1:(n-j),j]) + T2[(n-j+1):n,j+1]=L[j]*T2[(n-j+1):n,j] + }
Les facteurs de transition sont ici,
> L [1] 1.380933 1.011433 1.004343 1.001858 1.004735
et le triangle complété
> T2 X0 X1 X2 X3 X4 X5 2000 3209 4372.000 4411.000 4428.000 4435.000 4456.000 2001 3367 4659.000 4696.000 4720.000 4730.000 4752.397 2002 3871 5345.000 5398.000 5420.000 5430.072 5455.784 2003 4239 5917.000 6020.000 6046.147 6057.383 6086.065 2004 4929 6794.000 6871.672 6901.518 6914.344 6947.084 2005 5217 7204.327 7286.691 7318.339 7331.939 7366.656
Le montant de provision est ici en faisant la différence entre la charge ultime (dans la dernière colonne) et les derniers paiements observés (sur la seconde diagonale)
> CU=T2[,n] > Pat=diag(as.matrix(T2[,n:1])) > Ri=CU-Pat > R=sum(Ri)
soit, numériquement
> R [1] 2426.985
On avait alors vu que l’on pouvait calculer un tail factor, en supposant une décroissance exponentielle des facteurs de transition, et on rajoutait alors une colonne correspondant au montant ultime, par année d’accident,
> logL=log(L-1) > t=1:5 > b=data.frame(logL,t) > reg=lm(logL~t,data=b) > logLp=predict(reg,newdata=data.frame(t=6:100)) > Lp=exp(logLp)+1 > Linf=prod(Lp) > T3=T2 > T3$Xinf=T3$X5*Linf
On a ici
> T3 X0 X1 X2 X3 X4 X5 Xinf 2000 3209 4372.000 4411.000 4428.000 4435.000 4456.000 4459.149 2001 3367 4659.000 4696.000 4720.000 4730.000 4752.397 4755.755 2002 3871 5345.000 5398.000 5420.000 5430.072 5455.784 5459.639 2003 4239 5917.000 6020.000 6046.147 6057.383 6086.065 6090.366 2004 4929 6794.000 6871.672 6901.518 6914.344 6947.084 6951.993 2005 5217 7204.327 7286.691 7318.339 7331.939 7366.656 7371.862
(je laisse reprendre le code pour calculer le montant de provisions). Enfin, on avait montré comment utiliser une régression pondérée, pour calculer les facteurs de transition,
> T4=as.matrix(T$X0,n,1) > for(j in 1:(n-1)){ + Y=T[,j+1] + X=T[,j] + base=data.frame(X,Y) + reg=lm(Y~0+X,weights=1/X) + T4=cbind(T4, + predict(reg, + newdata=data.frame(X=T4[,j] + ))) + }
Ce qui donnait la même projection que la méthode Chain Ladder
> T4 [,1] [,2] [,3] [,4] [,5] [,6] 1 3209 4431.414 4482.076 4501.543 4509.909 4531.263 2 3367 4649.601 4702.758 4723.184 4731.961 4754.367 3 3871 5345.591 5406.705 5430.188 5440.279 5466.039 4 4239 5853.775 5920.698 5946.414 5957.464 5985.673 5 4929 6806.619 6884.435 6914.337 6927.186 6959.986 6 5217 7204.327 7286.691 7318.339 7331.939 7366.656
La suite mercredi prochain, même si on risque d’aller très vite sur la méthode de Mack (et les calculs d’erreur quadratique moyenne pour arriver à la régression de Poisson). A suivre donc…
Reserving with negative increments in triangles
A few months ago, I did published a post on negative values in triangles, and how to deal with them, when using a Poisson regression (the post was published in French). The idea was to use a translation technique:
- Fit a model not on ‘s but on , for some ,
- Use that model to make predictions, and then translate those predictions,
This is what was done to get the following graph, where a Poisson regression was fitted. Black points are ‘s while blue points are ‘s, for some . We fit a model to get the blue prediction, and then translate it to get the red prediction (on the ‘s).
In this example, there were no negative values, but it is possible to use it get a better understanding on the impact of this technique. The prediction, here, is the red line. And clearly, the value of has an impact on the prediction (since we do not consider, here, a linear model: with a linear model, translating has not impact at all, except on the intercept).
The alternative mentioned in the previous post was to use this technique on several ‘s, and them interpolate
- For a given , fit a model not on ‘s but on , use that model to make predictions, and then translate those predictions, .
- Do it for several ‘s.
- Use it to extrapolate when is (which is the case we are interested in).
In the context of loss reserving, the idea is extremely simple. Consider a triangle with incremental payments
> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R") > Y=T=PAID > n=ncol(T) > Y[,2:n]=T[,2:n]-T[,1:(n-1)] > Y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3209 1163 39 17 7 21 [2,] 3367 1292 37 24 10 NA [3,] 3871 1474 53 22 NA NA [4,] 4239 1678 103 NA NA NA [5,] 4929 1865 NA NA NA NA [6,] 5217 NA NA NA NA NA
Now, we do not have negative values, here, but we can still see is translation techniques can be used. The benchmark is the Poisson regression, since we can run it :
> y=as.vector(as.matrix(Y)) > base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n)) > reg=glm(y~as.factor(ai)+as.factor(bj),data=base,family=poisson)
Here, the amount is reserve is the sum of predicted values in the lower part of the triangle,
> py=predict(reg,newdata=base,type="response") > sum(py[is.na(base$y)]) [1] 2426.985
which is exactly Chain Ladder’s estimate.
Now, let us use a translation technique to compute the amount of reserves. The code will be
> decal=function(k){ + reg=glm(y+k~as.factor(ai)+as.factor(bj),data=base,family=poisson) + py=predict(reg,newdata=base,type="response") + return(sum(py[is.na(base$y)]-k))
For instance, if we translate of +5, we would get
> decal(5) [1] 2454.713
while a translation of +10 would return
> decal(10) [1] 2482.29
Clearly, translations do have an impact on the estimation. Here, just to check, if we do not translate, we do have Chain Ladder’s estimate,
> decal(0) [1] 2426.985
The idea mentioned in the previous post was to try several translations, and then extrapolate, to get the value in 0. Here, translations will give the following estimates
> K=10:20 > (V=Vectorize(decal)(K)) [1] 2482.290 2487.788 2493.279 2498.765 2504.245 2509.719 2515.187 2520.649 [9] 2526.106 2531.557 2537.001
We can plot those values, and run a regression
> plot(K,V,xlim=c(0,20),ylim=c(2425,2540)) > abline(h=decal(0),col="red",lty=2)
the dotted horizontal line is Chain Ladder. Now, let us extrapolate
> b=data.frame(K=K,D=V) > rk=lm(D~K,data=b) > predict(rk,newdata=data.frame(K=0)) 1 2427.623
On has to admit that it is not that bad. But yesterday evening, Karim asked me why I did use a linear regression, for my extrapolation. And to be honest, I do not know. I mean, the only answer might be that points are almost on a straight line. So the first time I saw it, I was exited, and I ran a linear regression.
Now, let us see if we can do better. Because here, we do use a translation of +10 or +20 (which might be rather small). What if we use much larger values ? (because we might have large negative incremental values). With the following code, we try, each time 11 consecutive values, the smallest one going from 0 to 50,
> hausse=1:50; res=rep(NA,50) > for(k in hausse){ + VK=k:(10+k) + b=data.frame(K=VK,D=Vectorize(decal)(VK)) + rk=lm(D~K,data=b) + res[k]=predict(rk,newdata=data.frame(K=0)) + } > plot(hausse,res,type="l",col="red",ylim=c(2422,2440)) > abline(rk,col="blue")
Here, we compute reserves when extrapolations were done after 11 translations, from to . With different values of . The case where is ten was the one mentioned above,
> res[hausse==10] [1] 2427.623
Actually, it might also be possible to consider not 11 translations, but 26, from to . Here, we get
> hausse=1:50; res=rep(NA,50) > for(k in hausse){ + VK=k:(25+k) + b=data.frame(K=VK,D=Vectorize(decal)(VK)) + rk=lm(D~K,data=b) + res[k]=predict(rk,newdata=data.frame(K=0)) + } > lines(hausse,res,type="l",col="blue",lty=2)
We now have the dotted line
Here, it is getting worst. So let us keep here 11 translations. Perhaps, we can try something different. For instance a Poisson regression, with a log like (i.e. we consider an exponential extrapolation),
> hausse=1:50; res=rep(NA,50) > for(k in hausse){ + VK=k:(10+k) + b=data.frame(K=VK,D=Vectorize(decal)(VK)) + rk=glm(D~K,data=b,family=poisson) + res[k]=predict(rk,newdata=data.frame(K=0),type="response") + } > lines(hausse,res,type="l",col="purple")
The purple line will be a Poisson model, with a log link. Perhaps we can try another link function, like a quadratic one
> hausse=1:50; res=rep(NA,50) > for(k in hausse){ + VK=k:(10+k) + b=data.frame(K=VK,D=Vectorize(decal)(VK)) + rk=glm(D~K,data=b,family=poisson(link= + power(lambda = 2))) + res[k]=predict(rk,newdata=data.frame(K=0),type="response") + } > lines(hausse,res,type="l",col="orange")
That would be the orange line,
Here, we need a link function between identity (the linear model, the blue line) and the quadratic one (the orange one), for instance a power function 3/2,
> hausse=1:50; res=rep(NA,50) > for(k in hausse){ + VK=k:(10+k) + b=data.frame(K=VK,D=Vectorize(decal)(VK)) + rk=glm(D~K,data=b,family=poisson(link= + power(lambda = 1.5))) + res[k]=predict(rk,newdata=data.frame(K=0),type="response") + } > lines(hausse,res,type="l",col="green")
Here, it looks like we can use that model for any kind of translation, from +10 till +50, even +100 ! But I do not have any intuition about the use of this power function…
Réassurrance
Mercredi aura lieu le dernier cours d’actuariat IARD.
Parmi les compléments, Introduction à la réassurance, publié par Swiss Re, ou ainsi que quelques documents plus techniques, comme The Pareto model in property reinsurance , Exposure rating, ou Designing property reinsurance programmes encore Introduction to reinsurance accounting. Plusieurs réassureurs (et courtiers en réassurance) publient des études techniques sur leurs sites, http://swissre.com/, http://munichre.com/, http://aon.com/, http://scor.com/ ou encore http://guycarp.com/. Sinon je renvois aux notes de cours de Peter Antal, quantitative methods in reinsurance.
Les transparents sont en ligne ici,
- Hurricane Katrina (US, Bahamas, Cuba, Aug. 2005), $ 72.3 billion
- Tōhoku earthquake and tsunami (Japan, March 2011), $ 35 billion
- Hurricane Andrew (US, Bahamas, August 1992), $ 25 billion
- September 11 attacks (US) $ 23.1 billion
- Northridge earthquake (US) $ 20.6 billion
- Hurricane Ike (US, Haiti, Dominican Republic, Sept. 2005) $ 20.5 billion
- Hurricane Ivan (US, Barbados, Sept. 2004) $ 14.9 billion
- Hurrican Wilman (US, Mexico, Jamaica, Oct. 2005), $ 14 billion
- Hurricane Rita (US, Cuba, Sept. 2005) $ 11.3 billion
- Hurricane Charley (US, Cuba, Jamaica) $ 9.3. billion
A titre de comparaison, les chiffres d’affaires des plus gros réassureurs (prime émise en 2010) étaient, selon http://www.insurancenetworking.com/…
- Munich Reinsurance Company $ 31.3 billion
- Swiss Reinsurance Company Limited $ 24.7 billion
- Hannover Rueckversicherung AG $ 15.1 billion
- Berkshire Hathaway Inc. $ 14.4 billion
- Lloyd’s $ 13 billion
- SCOR S.E. $ 8.8 billion
- Reinsurance Group of America Inc. $ 7.2 billion
- Allianz S.E. $ 5.7 billion
- PartnerRe Ltd. $ 4.9 billion
- Everest Re Group Ltd. $ 4.2 billion
Overdispersed Poisson et bootstrap
Pour le dernier cours sur les méthodes de provisionnement, on s’est arrête aux méthodes par simulation. Reprenons là où on en était resté au dernier billet où on avait vu qu’en faisant une régression de Poisson sur les incréments, on obtenait exactement le même montant que la méthode Chain Ladder,
> Y [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3209 1163 39 17 7 21 [2,] 3367 1292 37 24 10 NA [3,] 3871 1474 53 22 NA NA [4,] 4239 1678 103 NA NA NA [5,] 4929 1865 NA NA NA NA [6,] 5217 NA NA NA NA NA > y=as.vector(as.matrix(Y)) > base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n)) > reg2=glm(y~as.factor(ai)+as.factor(bj),data=base,family=poisson) > summary(reg2) Call: glm(formula = y ~ as.factor(ai) + as.factor(bj), family = poisson, data = base) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 8.05697 0.01551 519.426 < 2e-16 *** as.factor(ai)2001 0.06440 0.02090 3.081 0.00206 ** as.factor(ai)2002 0.20242 0.02025 9.995 < 2e-16 *** as.factor(ai)2003 0.31175 0.01980 15.744 < 2e-16 *** as.factor(ai)2004 0.44407 0.01933 22.971 < 2e-16 *** as.factor(ai)2005 0.50271 0.02079 24.179 < 2e-16 *** as.factor(bj)1 -0.96513 0.01359 -70.994 < 2e-16 *** as.factor(bj)2 -4.14853 0.06613 -62.729 < 2e-16 *** as.factor(bj)3 -5.10499 0.12632 -40.413 < 2e-16 *** as.factor(bj)4 -5.94962 0.24279 -24.505 < 2e-16 *** as.factor(bj)5 -5.01244 0.21877 -22.912 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 46695.269 on 20 degrees of freedom Residual deviance: 30.214 on 10 degrees of freedom (15 observations deleted due to missingness) AIC: 209.52 Number of Fisher Scoring iterations: 4 > base$py2=predict(reg2,newdata=base,type="response") > round(matrix(base$py2,n,n),1) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.7 1202.1 49.8 19.1 8.2 21.0 [2,] 3365.6 1282.1 53.1 20.4 8.8 22.4 [3,] 3863.7 1471.8 61.0 23.4 10.1 25.7 [4,] 4310.1 1641.9 68.0 26.1 11.2 28.7 [5,] 4919.9 1874.1 77.7 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.4 31.6 13.6 34.7 > sum(base$py2[is.na(base$y)]) [1] 2426.985
Le plus intéressant est peut être de noter que la loi de Poisson présente probablement trop peu de variance…
> reg2b=glm(y~as.factor(ai)+as.factor(bj),data=base,family=quasipoisson) > summary(reg2) Call: glm(formula = y ~ as.factor(ai) + as.factor(bj), family = quasipoisson, data = base) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 8.05697 0.02769 290.995 < 2e-16 *** as.factor(ai)2001 0.06440 0.03731 1.726 0.115054 as.factor(ai)2002 0.20242 0.03615 5.599 0.000228 *** as.factor(ai)2003 0.31175 0.03535 8.820 4.96e-06 *** as.factor(ai)2004 0.44407 0.03451 12.869 1.51e-07 *** as.factor(ai)2005 0.50271 0.03711 13.546 9.28e-08 *** as.factor(bj)1 -0.96513 0.02427 -39.772 2.41e-12 *** as.factor(bj)2 -4.14853 0.11805 -35.142 8.26e-12 *** as.factor(bj)3 -5.10499 0.22548 -22.641 6.36e-10 *** as.factor(bj)4 -5.94962 0.43338 -13.728 8.17e-08 *** as.factor(bj)5 -5.01244 0.39050 -12.836 1.55e-07 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for quasipoisson family taken to be 3.18623) Null deviance: 46695.269 on 20 degrees of freedom Residual deviance: 30.214 on 10 degrees of freedom (15 observations deleted due to missingness) AIC: NA Number of Fisher Scoring iterations: 4
Mais on en reparlera dans un instant. Ensuite, on avait commencé à regarder erreurs commises, sur la partie supérieure du triangle. Classiquement, par construction, les résidus de Pearson sont de la forme
On avait vu dans le cours de tarification que la variance au dénominateur pouvait être remplacé par le prévision, puisque dans un modèle de Poisson, l’espérance et la variance sont identiques. Donc on considérait
> base$erreur=(base$y-base$py2)/sqrt(base$py2) > round(matrix(base$erreur,n,n),1) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 0.9 -1.1 -1.5 -0.5 -0.4 0 [2,] 0.0 0.3 -2.2 0.8 0.4 NA [3,] 0.1 0.1 -1.0 -0.3 NA NA [4,] -1.1 0.9 4.2 NA NA NA [5,] 0.1 -0.2 NA NA NA NA [6,] 0.0 NA NA NA NA NA
Le soucis est que si est – asymptotiquement – un bon estimateur pour , ce n’est pas le cas à distance finie, car on a alors un estimateur biaisé pour la variance, et donc la variance des résidus n’a que peu de chances d’être de variance unitaire. Aussi, il convient de corriger l’estimateur de la variance, et on pose alors
qui sont alors les résidus de Pearson tel qu’on doit les utiliser.
> E=base$erreur[is.na(base$y)==FALSE]*sqrt(21/(21-11)) > E [1] 1.374976e+00 3.485024e-02 1.693203e-01 -1.569329e+00 1.887862e-01 [6] -1.459787e-13 -1.634646e+00 4.018940e-01 8.216186e-02 1.292578e+00 [11] -3.058764e-01 -2.221573e+00 -3.207593e+00 -1.484151e+00 6.140566e+00 [16] -7.100321e-01 1.149049e+00 -4.307387e-01 -6.196386e-01 6.000048e-01 [21] -8.987734e-15 > boxplot(E,horizontal=TRUE)
En rééchantillonnant dans ces résidus, on peut alors générer un pseudo triangle. Pour des raisons de simplicités, on va générer un peu rectangle, et se restreindre à la partie supérieure,
> Eb=sample(E,size=36,replace=TRUE) > Yb=base$py2+Eb*sqrt(base$py2) > Ybna=Yb > Ybna[is.na(base$y)]=NA > Tb=matrix(Ybna,n,n) > round(matrix(Tb,n,n),1) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3115.8 1145.4 58.9 46.0 6.4 26.9 [2,] 3179.5 1323.2 54.5 21.3 12.2 NA [3,] 4245.4 1448.1 61.0 7.9 NA NA [4,] 4312.4 1581.7 68.7 NA NA NA [5,] 4948.1 1923.9 NA NA NA NA [6,] 4985.3 NA NA NA NA NA
Cette fois, on a un nouveau triangle ! on va alors pouvoir faire plusieurs choses,
- compléter le triangle para la méthode Chain Ladder, c’est à dire calculer les montants moyens que l’on pense payer dans les années futures
- générer des scénarios de paiements pour les années futurs, en générant des paiements suivant des lois de Poisson (centrées sur les montants moyens que l’on vient de calculer)
- générer des scénarios de paiements avec des lois présentant plus de variance que la loi de Poisson. Idéalement, on voudrait simuler des lois qusi-Poisson, mais ce ne sont pas de vraies lois. Par contre, on peut se rappeler que dans ce cas, la loi Gamma devrait donner une bonne approximation.
Pour ce dernier point, on va utiliser le code suivant, pour générer des quasi lois,
> rqpois = function(n, lambda, phi, roundvalue = TRUE) { + b = phi + a = lambda/phi + r = rgamma(n, shape = a, scale = b) + if(roundvalue){r=round(r)} + return(r) + }
Je renvois aux diverses notes de cours pour plus de détails sur la justification, ou à un vieux billet. On va alors faire une petite fonction, qui va soit somme les paiements moyens futurs, soit sommer des générations de scénarios de paiements, à partir d’un triangle,
> CL=function(Tri){ + y=as.vector(as.matrix(Tri)) + base=data.frame(y,ai=rep(2000:2005,n),bj=rep(0:(n-1),each=n)) + reg=glm(y~as.factor(ai)+as.factor(bj),data=base,family=quasipoisson) + py2=predict(reg,newdata=base,type="response") + pys=rpois(36,py2) + pysq=rqpois(36,py2,phi=3.18623) + return(list( + cl=sum(py2[is.na(base$y)]), + sc=sum(pys[is.na(base$y)]), + scq=sum(pysq[is.na(base$y)]))) + }
Reste alors à générer des paquets de triangles. Toutefois, il est possible de générer des triangles avec des incréments négatifs. Pour faire simple, on mettra des valeurs nulles quand on a un paiement négatif. L’impact sur les quantiles sera alors (a priori) négligeable.
> for(s in 1:1000){ + Eb=sample(E,size=36,replace=TRUE)*sqrt(21/(21-11)) + Yb=base$py2+Eb*sqrt(base$py2) + Yb=pmax(Yb,0) + scY=rpois(36,Yb) + Ybna=Yb + Ybna[is.na(base$y)]=NA + Tb=matrix(Ybna,6,6) + C=CL(Tb) + VCL[s]=C$cl + VR[s]=C$sc + VRq[s]=C$scq + }
Si on regarde la distribution du best estimate, on obtient
> hist(VCL,proba=TRUE,col="light blue",border="white",ylim=c(0,0.003)) > boxplot(VCL,horizontal=TRUE,at=.0023,boxwex = 0.0006,add=TRUE,col="light green") > D=density(VCL) > lines(D) > I=which(D$x<=quantile(VCL,.05)) > polygon(c(D$x[I],rev(D$x[I])),c(D$y[I],rep(0,length(I))),col="blue",border=NA) > I=which(D$x>=quantile(VCL,.95)) > polygon(c(D$x[I],rev(D$x[I])),c(D$y[I],rep(0,length(I))),col="blue",border=NA)
Mais on peut aussi visualiser des scénarios basés sur des lois de Poisson (équidispersé) ou des scénarios de lois quasiPoisson (surdispersées), ci-dessous
Dans ce dernier cas, on peut en déduire le quantile à 99% des paiements à venir.
> quantile(VRq,.99) 99% 2855.01
Il faut donc augmenter le montant de provisions de l’ordre 15% pour s’assurer que la compagnie pourra satisfaire ses engagements dans 99% des cas,
> quantile(VRq,.99)-2426.985 99% 428.025
Provisionnement et tarification, examen final
L’examen final du cours ACT2040 avait lieu ce matin. L’énoncé est en ligne ainsi que des éléments de correction. En cas d’erreurs, merci de me le faire savoir rapidement, avant que je ne saisisse les notes.
Poisson regression on non-integers
In the course on claims reserving techniques, I did mention the use of Poisson regression, even if incremental payments were not integers. For instance, we did consider incremental triangles
> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R") > INC=PAID > INC[,2:6]=PAID[,2:6]-PAID[,1:5] > INC [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3209 1163 39 17 7 21 [2,] 3367 1292 37 24 10 NA [3,] 3871 1474 53 22 NA NA [4,] 4239 1678 103 NA NA NA [5,] 4929 1865 NA NA NA NA [6,] 5217 NA NA NA NA NA
On those payments, it is natural to use a Poisson regression, to predict future payments
> Y=as.vector(INC) > D=rep(1:6,each=6) > A=rep(2001:2006,6) > base=data.frame(Y,D,A) > reg=glm(Y~as.factor(D)+as.factor(A),data=base,family=poisson(link="log")) > Yp=predict(reg,type="response",newdata=base) > matrix(Yp,6,6) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 21.0 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
and the total amount of reserves would be
> sum(Yp[is.na(Y)==TRUE]) [1] 2426.985
Here, payments were in ’000 euros. What if they were in ’000’000 euros ?
> a=1000 > INC/a [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3.209 1.163 0.039 0.017 0.007 0.021 [2,] 3.367 1.292 0.037 0.024 0.010 NA [3,] 3.871 1.474 0.053 0.022 NA NA [4,] 4.239 1.678 0.103 NA NA NA [5,] 4.929 1.865 NA NA NA NA [6,] 5.217 NA NA NA NA NA
We can still run a regression here
> reg=glm((Y/a)~as.factor(D)+as.factor(A),data=base,family=poisson(link="log")) > Yp=predict(reg,type="response",newdata=base) > sum(Yp[is.na(Y)==TRUE])*a [1] 2426.985
and the prediction is exactly the same. Actually, it is possible to change currency, and multiply by any kind of constant, the Poisson regression will return always the same prediction, if we use a log link function,
> homogeneity=function(a=1){ + reg=glm((Y/a)~as.factor(D)+as.factor(A), data=base,family=poisson(link="log")) + Yp=predict(reg,type="response",newdata=base) + return(sum(Yp[is.na(Y)==TRUE])*a) + } > Vectorize(homogeneity)(10^(seq(-3,5))) [1] 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985 2426.985
The trick here come from the fact that we do like the Poisson interpretation. But GLMs simply mean that we do want to solve a first order condition. It is possible to solve explicitly the first order condition, which was obtained without any condition such that values should be integers. To run a simple code, the intercept should be related to the last value of the matrix, not the first one.
> base$D=relevel(as.factor(base$D),"6") > base$A=relevel(as.factor(base$A),"2006") > reg=glm(Y~as.factor(D)+as.factor(A), data=base,family=poisson(link="log")) > summary(reg) Call: glm(formula = Y ~ as.factor(D) + as.factor(A), family = poisson(link = "log"), data = base) Deviance Residuals: Min 1Q Median 3Q Max -2.3426 -0.4996 0.0000 0.2770 3.9355 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) 3.54723 0.21921 16.182 < 2e-16 *** as.factor(D)1 5.01244 0.21877 22.912 < 2e-16 *** as.factor(D)2 4.04731 0.21896 18.484 < 2e-16 *** as.factor(D)3 0.86391 0.22827 3.785 0.000154 *** as.factor(D)4 -0.09254 0.25229 -0.367 0.713754 as.factor(D)5 -0.93717 0.32643 -2.871 0.004092 ** as.factor(A)2001 -0.50271 0.02079 -24.179 < 2e-16 *** as.factor(A)2002 -0.43831 0.02045 -21.433 < 2e-16 *** as.factor(A)2003 -0.30029 0.01978 -15.184 < 2e-16 *** as.factor(A)2004 -0.19096 0.01930 -9.895 < 2e-16 *** as.factor(A)2005 -0.05864 0.01879 -3.121 0.001799 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 46695.269 on 20 degrees of freedom Residual deviance: 30.214 on 10 degrees of freedom (15 observations deleted due to missingness) AIC: 209.52
The first idea is to run a gradient descent, as follows (the starting point will be coefficients from a linear regression on the log of the observations),
> YNA <- Y > XNA=matrix(0,length(Y),1+5+5) > XNA[,1]=rep(1,length(Y)) > for(k in 1:5) XNA[(k-1)*6+1:6,k+1]=k > u=(1:(length(Y))%%6); u[u==0]=6 > for(k in 1:5) XNA[u==k,k+6]=k > YnoNA=YNA[is.na(YNA)==FALSE] > XnoNA=XNA[is.na(YNA)==FALSE,] > beta=lm(log(YnoNA)~0+XnoNA)$coefficients > for(s in 1:50){ + Ypred=exp(XnoNA%*%beta) + gradient=t(XnoNA)%*%(YnoNA-Ypred) + omega=matrix(0,nrow(XnoNA),nrow(XnoNA));diag(omega)=exp(XnoNA%*%beta) + hessienne=-t(XnoNA)%*%omega%*%XnoNA + beta=beta-solve(hessienne)%*%gradient} > beta [,1] [1,] 3.54723486 [2,] 5.01244294 [3,] 2.02365553 [4,] 0.28796945 [5,] -0.02313601 [6,] -0.18743467 [7,] -0.50271242 [8,] -0.21915742 [9,] -0.10009587 [10,] -0.04774056 [11,] -0.01172840
We are not too far away from the values given by R. Actually, it is just fine if we focus on the predictions
> matrix(exp(XNA%*%beta),6,6)) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 21.0 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
which are exactly the one obtained above. And here, we clearly see that there is no assumption such as “explained variate should be an integer“. It is also possible to remember that the first order condition is the same as the one we had with a weighted least square model. The problem is that the weights are function of the prediction. But using an iterative algorithm, we should converge,
> beta=lm(log(YnoNA)~0+XnoNA)$coefficients > for(i in 1:50){ + Ypred=exp(XnoNA%*%beta) + z=XnoNA%*%beta+(YnoNA-Ypred)/Ypred + REG=lm(z~0+XnoNA,weights=Ypred) + beta=REG$coefficients + } > > beta XnoNA1 XnoNA2 XnoNA3 XnoNA4 XnoNA5 XnoNA6 3.54723486 5.01244294 2.02365553 0.28796945 -0.02313601 -0.18743467
XnoNA7 XnoNA8 XnoNA9 XnoNA10 XnoNA11 -0.50271242 -0.21915742 -0.10009587 -0.04774056 -0.01172840
which are the same values as the one we got previously. Here again, the prediction is the same as the one we got from this so-called Poisson regression,
> matrix(exp(XNA%*%beta),6,6) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 3155.6 1202.1 49.8 19.1 8.2 20.9 [2,] 3365.6 1282.0 53.1 20.4 8.7 22.3 [3,] 3863.7 1471.8 60.9 23.4 10.0 25.7 [4,] 4310.0 1641.8 68.0 26.1 11.2 28.6 [5,] 4919.8 1874.1 77.6 29.8 12.8 32.7 [6,] 5217.0 1987.3 82.3 31.6 13.5 34.7
Again, it works just fine because GLMs are mainly conditions on the first two moments, and numerical computations are based on the first order condition, which has less constraints than the interpretation in terms of a Poisson model.
R tutorials
My course on non-life insurance (ACT2040) will start in a few weeks. I will use R to illustrate predictive modeling. A nice introduction for those who do not know R can be found online.
Residuals from a logistic regression
I always claim that graphs are important in econometrics and statistics ! Of course, it is usually not that simple. Let me come back to a recent experience. A got an email from Sami yesterday, sending me a graph of residuals, and asking me what could be done with a graph of residuals, obtained from a logistic regression ? To get a better understanding, let us consider the following dataset (those are simulated data, but let us assume – as in practice – that we do not know the true model, this is why I decided to embed the code in some R source file)
> source("http://freakonometrics.free.fr/probit.R") > reg=glm(Y~X1+X2,family=binomial)
If we use R’s diagnostic plot, the first one is the scatterplot of the residuals, against predicted values (the score actually)
> plot(reg,which=1)
we is simply
> plot(predict(reg),residuals(reg)) > abline(h=0,lty=2,col="grey")
Why do we have those two lines of points ? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points) and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship… We can see more clearly what’s going on when we use colors
> plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y]) > abline(h=0,lty=2,col="grey")
Points are exactly on a smooth curve, as a function of the predicted value,
Now, there is nothing from this graph. If we want to understand, we have to run a local regression, to see what’s going on,
> lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2)
This is exactly what we have with the first function. But with this local regression, we do not get confidence interval. Can’t we pretend, on the graph about that the plain dark line is very close to the dotted line ?
> rl=lm(residuals(reg)~bs(predict(reg),8)) > #rl=loess(residuals(reg)~predict(reg)) > y=predict(rl,se=TRUE) > segments(predict(reg),y$fit+2*y$se.fit,predict(reg),y$fit-2*y$se.fit,col="green")
Yes, we can.And even if we have a guess that something can be done, what would this graph suggest ?
Actually, that graph is probably not the only way to look at the residuals. What not plotting them against the two explanatory variables ? For instance, if we plot the residuals against the second one, we get
> plot(X2,residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(X2,residuals(reg)),col="black",lwd=2) > lines(lowess(X2[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(X2[Y==1],residuals(reg)[Y==1]),col="red") > abline(h=0,lty=2,col="grey")
The graph is similar to the one we had earlier, and against, there is not much to say,
If we now look at the relationship with the first one, it starts to be more interesting,
> plot(X1,residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(X1,residuals(reg)),col="black",lwd=2) > lines(lowess(X1[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(X1[Y==1],residuals(reg)[Y==1]),col="red") > abline(h=0,lty=2,col="grey")
since we can clearly identify a quadratic effect. This graph suggests that we should run a regression on the square of the first variable. And it can be seen as a significant effect,
Now, if we run a regression including this quadratic effect, what do we have,
> reg=glm(Y~X1+I(X1^2)+X2,family=binomial) > plot(predict(reg),residuals(reg),col=c("blue","red")[1+Y]) > lines(lowess(predict(reg)[Y==0],residuals(reg)[Y==0]),col="blue") > lines(lowess(predict(reg)[Y==1],residuals(reg)[Y==1]),col="red") > lines(lowess(predict(reg),residuals(reg)),col="black",lwd=2) > abline(h=0,lty=2,col="grey")
Actually, it looks like we back where we were initially…. So what is my point ? my point is that
- graphs (yes, plural) can be used to see what might go wrong, to get more intuition about possible non linear transformation
- graphs are not everything, and they never be perfect ! Here, in theory, to plain line should be a straight line, horizontal. But we also want a model as simple as possible. So, at some stage, we should probably give up, and rely on statistical tests, and confidence intervals. Yes, almost a flat line can be interpreted as flat.
Assurance IARD
Mercredi aura lieu le premier cours d’assurance IARD de la session (ACT2040). Le plan de cours est en ligne, ainsi que les transparents de la première session. Nous verrons une introduction a la tarification, avec hétérogénéité, et quelques rappels de statistiques.
Des informations suivront, en particulier pour les démonstrations.
Introduction à la régression logistique et aux arbres
Pour le second cours ACT2040, on va finir l’introduction (et les rappels de statistique inférentiel) puis attaque la première grosse section, sur la régression logistique et aux arbres de classification. base tirée du livre de Jed Frees, http://instruction.bus.wisc.edu/jfrees/…
> baseavocat=read.table("http://freakonometrics.free.fr/AutoBI.csv", + header=TRUE,sep=",") > tail(baseavocat) CASENUM ATTORNEY CLMSEX MARITAL CLMINSUR SEATBELT CLMAGE LOSS 1335 34204 2 2 2 2 1 26 0.161 1336 34210 2 1 2 2 1 NA 0.576 1337 34220 1 2 1 2 1 46 3.705 1338 34223 2 2 1 2 1 39 0.099 1339 34245 1 2 2 1 1 18 3.277 1340 34253 2 2 2 2 1 30 0.688
On dispose d’une variable dichotomique indiquant si un assuré – suite à un accident de la route – a été représenté par un avocat (1 si oui, 2 si non). On connaît le sexe de l’assuré (1 pour les hommes et 2 pour les femmes), le statut marital (1 s’il est marié, 2 s’il est célibataire, 3 pour un veuf, et 4 pour un assuré divorcé). On sait aussi si l’assuré portait ou non une ceinture de sécurité lorsque l’accident s’est produit (1 si oui, 2 si non et 3 si l’information n’est pas connue). Enfin, une information pour savoir si le conducteur du véhicule était ou non assuré (1 si oui, 2 si non et 3 si l’information n’est pas connue). On va recoder un peu les données afin de les rendre plus claires à lire.
Les transparents sont en ligne sur le blog,
Des compléments théoriques sur les arbres peuvent se trouver en ligne http://genome.jouy.inra.fr/…, http://ensmp.fr/…, ou http://ujf-grenoble.fr/… (pour information, nous ne verrons que la méthode CART). Je peux renvoyer au livre (et au blog) de Stéphane Tuffery, ou (en anglais) au livre de Richard Berk, dont un résumé se trouve en ligne sur http://crim.upenn.edu/….
Non-observable vs. observable heterogeneity factor
This morning, in the ACT2040 class (on non-life insurance), we’ve discussed the difference between observable and non-observable heterogeneity in ratemaking (from an economic perspective). To illustrate that point (we will spend more time, later on, discussing observable and non-observable risk factors), we looked at the following simple example. Let denote the height of a person. Consider the following dataset
> Davis=read.table( + "http://socserv.socsci.mcmaster.ca/jfox/Books/Applied-Regression-2E/datasets/Davis.txt")
There is a small typo in the dataset, so let us make manual changes here
> Davis[12,c(2,3)]=Davis[12,c(3,2)]
Here, the variable of interest is the height of a given person,
> X=Davis$height
If we look at the histogram, we have
> hist(X,col="light green", border="white",proba=TRUE,xlab="",main="")
Can we assume that we have a Gaussian distribution ?
Maybe not… Here, if we fit a Gaussian distribution, plot it, and add a kernel based estimator, we get
> (param <- fitdistr(X,"normal")$estimate) > f1 <- function(x) dnorm(x,param[1],param[2]) > x=seq(100,210,by=.2) > lines(x,f1(x),lty=2,col="red") > lines(density(X))
If you look at that black line, you might think of a mixture, i.e. something like
(using standard mixture notations). Mixture are obtained when we have a non-observable heterogeneity factor: with probability , we have a random variable (call it type [1]), and with probability , a random variable (call it type [2]). So far, nothing new. And we can fit such a mixture distribution, using e.g.
> library(mixtools)
> mix <- normalmixEM(X)
number of iterations= 335
> (param12 <- c(mix$lambda[1],mix$mu,mix$sigma))
[1] 0.4002202 178.4997298 165.2703616 6.3561363 5.9460023
If we plot that mixture of two Gaussian distributions, we get
> f2 <- function(x){ param12[1]*dnorm(x,param12[2],param12[4]) + (1-param12[1])*dnorm(x,param12[3],param12[5]) } > lines(x,f2(x),lwd=2, col="red") lines(density(X))
Not bad. Actually, we can try to maximize the likelihood with our own codes,
> logdf <- function(x,parameter){ + p <- parameter[1] + m1 <- parameter[2] + s1 <- parameter[4] + m2 <- parameter[3] + s2 <- parameter[5] + return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2))) + } > logL <- function(parameter) -sum(logdf(X,parameter)) > Amat <- matrix(c(1,-1,0,0,0,0, + 0,0,0,0,1,0,0,0,0,0,0,0,0,1), 4, 5) > bvec <- c(0,-1,0,0) > constrOptim(c(.5,160,180,10,10), logL, NULL, ui = Amat, ci = bvec)$par [1] 0.5996263 165.2690084 178.4991624 5.9447675 6.3564746
Here, we include some constraints, to insurance that the probability belongs to the unit interval, and that the variance parameters remain positive. Note that we have something close to the previous output.
Let us try something a little bit more complex now. What if we assume that the underlying distributions have the same variance, namely
In that case, we have to use the previous code, and make small changes,
> logdf <- function(x,parameter){ + p <- parameter[1] + m1 <- parameter[2] + s1 <- parameter[4] + m2 <- parameter[3] + s2 <- parameter[4] + return(log(p*dnorm(x,m1,s1)+(1-p)*dnorm(x,m2,s2))) + } > logL <- function(parameter) -sum(logdf(X,parameter)) > Amat <- matrix(c(1,-1,0,0,0,0,0,0,0,0,0,1), 3, 4) > bvec <- c(0,-1,0) > (param12c= constrOptim(c(.5,160,180,10), logL, NULL, ui = Amat, ci = bvec)$par) [1] 0.6319105 165.6142824 179.0623954 6.1072614
This is what we can do if we cannot observe the heterogeneity factor. But wait… we actually have some information in the dataset. For instance, we have the sex of the person. Now, if we look at histograms of height per sex, and kernel based density estimator of the height, per sex, we have
So, it looks like the height for male, and the height for female are different. Maybe we can use that variable, that was actually observed, to explain the heterogeneity in our sample. Formally, here, the idea is to consider a mixture, with an observable heterogeneity factor: the sex,
We now have interpretation of what we used to call class [1] and [2] previously: male and female. And here, estimating parameters is quite simple,
> (pM <- mean(sex=="M")) [1] 0.44 > (paramF <- fitdistr(X[sex=="F"],"normal")$estimate) mean sd 164.714286 5.633808 > (paramM <- fitdistr(X[sex=="M"],"normal")$estimate) mean sd 178.011364 6.404001
And if we plot the density, we have
> f4 <- function(x) pM*dnorm(x,paramM[1],paramM[2])+(1-pM)*dnorm(x,paramF[1],paramF[2]) > lines(x,f4(x),lwd=3,col="blue")
What if, once again, we assume identical variance? Namely, the model becomes
Then a natural idea to derive an estimator for the variance, based on previous computations, is to use
The code is here
> s=sqrt((sum((height[sex=="M"]-paramM[1])^2)+sum((height[sex=="F"]-paramF[1])^2))/(nrow(Davis)-2)) > s [1] 6.015068
and again, it is possible to plot the associated density,
> f5 <- function(x) pM*dnorm(x,paramM[1],s)+(1-pM)*dnorm(x,paramF[1],s) > lines(x,f5(x),lwd=3,col="blue")
Now, if we think a little about what we’ve just done, it is simply a linear regression on a factor, the sex of the person,
where . And indeed, if we run the code to estimate this linear model,
> summary(lm(height~sex,data=Davis)) Call: lm(formula = height ~ sex, data = Davis) Residuals: Min 1Q Median 3Q Max -16.7143 -3.7143 -0.0114 4.2857 18.9886 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 164.7143 0.5684 289.80 <2e-16 *** sexM 13.2971 0.8569 15.52 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Residual standard error: 6.015 on 198 degrees of freedom Multiple R-squared: 0.5488, Adjusted R-squared: 0.5465 F-statistic: 240.8 on 1 and 198 DF, p-value: < 2.2e-16
we get the same estimators for the means and the variance as the ones obtained previously. So, as mentioned this morning in class, if you have a non-observable heterogeneity factor, we can use a mixture model to fit a distribution, but if you can get a proxy of that factor, that is observable, then you can run a regression. But most of the time, that observable variable is just a proxy of a non-observable one…
Régression de Poisson
Mercredi, on finira les arbres de classification, et on commencera la modélisation de la fréquence de sinistre. Les transparents sont en ligne.
Comme annoncé lors du premier cours, je suggère de commencer la lecture du Practicionner’s Guide to Generalized Linear Models. Le document correspond au minimum attendu dans ce cours.
Logistic regression and categorical covariates
A short post to get back – for my nonlife insurance course – on the interpretation of the output of a regression when there is a categorical covariate. Consider the following dataset
> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";") > attach(db) > tail(db) Y X1 X2 X3 995 1 4.801836 20.82947 A 996 1 9.867854 24.39920 C 997 1 5.390730 21.25119 D 998 1 6.556160 20.79811 D 999 1 4.710276 21.15373 A 1000 1 6.631786 19.38083 A
Let us run a logistic regression on that dataset
> reg = glm(Y~X1+X2+X3,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.45885 1.04646 -4.261 2.04e-05 *** X1 0.51664 0.11178 4.622 3.80e-06 *** X2 0.21008 0.07247 2.899 0.003745 ** X3B 1.74496 0.49952 3.493 0.000477 *** X3C -0.03470 0.35691 -0.097 0.922543 X3D 0.08004 0.34916 0.229 0.818672 X3E 2.21966 0.56475 3.930 8.48e-05 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 552.64 on 999 degrees of freedom Residual deviance: 397.69 on 993 degrees of freedom AIC: 411.69 Number of Fisher Scoring iterations: 7
Here, the reference is modality . Which means that for someone with characteristics , we predict the following probability
where denotes the cumulative distribution function of the logistic distribution
For someone with characteristics , we predict the following probability
For someone with characteristics , we predict the following probability
(etc.) Here, if we accept (against ), it means that modality cannot be considerd as different from .
A natural idea can be to change the reference modality, and to look at the -values. If we consider the following loop, we get
> M = matrix(NA,5,5) > rownames(M)=colnames(M)=LETTERS[1:5] > for(k in 1:5){ + db$X3 = relevel(X3,LETTERS[k]) + reg = glm(Y~X1+X2+X3,family=binomial,data=db) + M[levels(db$X3)[-1],k] = summary(reg)$coefficients[4:7,4] + } > M A B C D E A NA 0.0004771853 9.225428e-01 0.8186723647 8.482647e-05 B 4.771853e-04 NA 4.841204e-04 0.0009474491 4.743636e-01 C 9.225428e-01 0.0004841204 NA 0.7506242347 9.194193e-05 D 8.186724e-01 0.0009474491 7.506242e-01 NA 1.730589e-04 E 8.482647e-05 0.4743636442 9.194193e-05 0.0001730589 NA
and if we simply want to know if the -value exceeds – or not – 5%, we get the following,
> M.TF = M>.05 > M.TF A B C D E A NA FALSE TRUE TRUE FALSE B FALSE NA FALSE FALSE TRUE C TRUE FALSE NA TRUE FALSE D TRUE FALSE TRUE NA FALSE E FALSE TRUE FALSE FALSE NA
The first column is obtained when is the reference, and then, we see which parameter should be considered as null. The interpretation is the following:
- and are not different from
- is not different from
- and are not different from
- and are not different from
- is not different from
Note that we only have, here, some kind of intuition. So, let us run a more formal test. Let us consider the following regression (we remove the intercept to get a model easier to understand)
> library(car) > db$X3=relevel(X3,"A") > reg=glm(Y~0+X1+X2+X3,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) X1 0.51664 0.11178 4.622 3.80e-06 *** X2 0.21008 0.07247 2.899 0.00374 ** X3A -4.45885 1.04646 -4.261 2.04e-05 *** X3E -2.23919 1.06666 -2.099 0.03580 * X3D -4.37881 1.04887 -4.175 2.98e-05 *** X3C -4.49355 1.06266 -4.229 2.35e-05 *** X3B -2.71389 1.07274 -2.530 0.01141 *
--- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1386.29 on 1000 degrees of freedom Residual deviance: 397.69 on 993 degrees of freedom AIC: 411.69 Number of Fisher Scoring iterations: 7
It is possible to use Fisher test to test if some coefficients are equal, or not (more generally if some linear constraints are satisfied)
> linearHypothesis(reg,c("X3A=X3C","X3A=X3D","X3B=X3E")) Linear hypothesis test Hypothesis: X3A - X3C = 0 X3A - X3D = 0 - X3E + X3B = 0 Model 1: restricted model Model 2: Y ~ 0 + X1 + X2 + X3 Res.Df Df Chisq Pr(>Chisq) 1 996 2 993 3 0.6191 0.892
Here, we clearly accept the assumption that the first three factors are equal, as well as the last two. What is the next step? Well, if we believe that there are mainly two categories, and , let us create that factor,
> X3bis=rep(NA,length(X3)) > X3bis[X3%in%c("A","C","D")]="ACD" > X3bis[X3%in%c("B","E")]="BE" > db$X3bis=as.factor(X3bis) > reg=glm(Y~X1+X2+X3bis,family=binomial,data=db) > summary(reg) Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -4.39439 1.02791 -4.275 1.91e-05 *** X1 0.51378 0.11138 4.613 3.97e-06 *** X2 0.20807 0.07234 2.876 0.00402 ** X3bisBE 1.94905 0.36852 5.289 1.23e-07 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 552.64 on 999 degrees of freedom Residual deviance: 398.31 on 996 degrees of freedom AIC: 406.31 Number of Fisher Scoring iterations: 7
Here, all the categories are significant. So we do have a proper model.
Nice tutorials to discover R
A series of tutorials, in R, by Anthony Damico. As claimed on http://twotorials.com/, “how to do stuff in r. two minutes or less, for those of us who prefer to learn by watching and listening“. So far,
- 000 what is r? the lingua statistica, s’il vous plaît
- 001 how to download and install r
- 002 simple shortcuts for the windows r console
- 003 how to do simple arithmetic in r
- 004 how to create a variable with r
- 005 how to use the c (combine) function
- 006 what does object oriented statistical programming really mean?
- 007 how to work with data tables in r
- 008 how to install and load a package in r
- 009 how to use a function and read the help files in r
- 010 how to read csv files into r
- 011 how to create a logical test and then build an if statement in r
- 012 how to build a for loop in r
- 013 how to read spss, stata, and sas files into r
- 014 how to read an excel file (dot xls and dot xlsx) into a data frame with r
- 015 how to work with character strings in r
- 016 how to add comments, save a script file, and make your work reproducible in r
- 017 how to use coercion and convert numbers to characters to logicals in r
- 018 help resources for r: the r-help mailing list, ucla academic technology services, and r-bloggers
- 019 how to access different records within a data frame by using logical tests in r
- 020 how to use the ifelse function in r
- 021 how to put things together in r: c (combine), paste, the paste collapse parameter, and the outer function
- 022 how to generate basic descriptive statistics – means, medians, sums, quantiles – on data tables in r
- 023 how to prevent r from breaking when it hits an error with the try function
- 024 how to run counts, tables, crosstabs, and flat tables on a data frame in r
- 025 how to change and recode variables and values in a data table using the fix and transform functions in r
- 026 how to sort and order stuff in r
- 027 how to run sql commands on data frames in r with the sqldf package
- 028 how to merge stuff together using the merge, cbind, rbind, and rbind.fill functions in r
- 029 how to run analyses across multiple categories of a data table with the tapply and aggregate functions in r
- 030 what is a function anyway and how does it input and output stuff in r
- 031 what are the different logical tests that you can conduct in r
- 032 how to handle NA (missing values) in r: understand and detect NA, remove NA, replace NA with 0
- 033 what does the %in% operator do and how does it differ from the double equals (==) sign?
- 034 how to enter your own data by hand and create a data frame in r
- 035 how to name or rename rows and columns in a data frame in r
- 036 how to output or export data and write tables to csv files in r
- 037 how to round a number in r: round, floor, ceiling, even excel_round
- 038 how to graph simple plots and make scatterplots in r
- 039 how to draw a line, add points, overlay a plot once it’s been created with (low-level) plotting functions in r
- 040 how to remove rows and columns from a data table in r
- 041 did someone send you a script file or computer code for the r programming language? here’s how to run it!
- 042 how do functions within functions within functions work in r?
- 043 when and how to use the %in% operator when accessing rows in a data frame that contains missing values in r
- 044 how and why r repeats an element over and over again without you even knowing it
- 045 how to bin or categorize or group data into new variables in a data table in r with the ifelse function
- 046 how to delete or remove variables, objects, data tables, and other stuff from memory in r
- 047 how many different ways can you slice and dice (subset) a data table in r?
- 048 how to download a data table or zipped file from the web directly into r, save it locally, then load or open it
- 049 how to make custom data tables and aggregate data tables with sql commands in r
- 050 how to count stuff in r and then use those counts to loop through stuff
- 051 how to read and write excel files with the xlsx package in r
- 052 how to output or export data tables from r into different formats: dbf, stata, spss, and sas files
- 053 how to import and export data to and from your windows clipboard with r
- 054 how to merge data frames one-to-one, many-to-one, one-to-many, many-to-many and make sure you did it right in r
- 055 how to (make or generate or draw or display) a (boxplot or histogram or bar graph or pie chart) in r
- 056 how to bin or categorize or group data into new variables in a data table or vector in r with the cut function
- 057 how to use the sample function and sample from a vector or take a random sample of rows from a data frame in r
- 058 how to use the subset function in r for more intuitive subsets of data frames
- 059 how to search within a string for matches, substitute stuff in strings, even trim whitespace in a string in r
- 060 how to output or export data frames in r to a sas transport file (dot xpt) with the write.xport function
- 061 how to generate a random number in r
- 062 how to generate random numbers according to different statistical distributions in r
- 063 how to generate the same random number every time you run a program in r with the set.seed function
- 064 fun one: how many ancestors did you have in each previous generation?
- 065 how to title a plot or graph in r, add axes and other text or words with the axis and text functions
- 066 how to (dynamically) create different objects according to a pattern using the assign function
- 067 how to (dynamically) retrieve different objects according to a pattern using the get function
- 068 how and when to load and save your workspace in r
- 069 how to access the elements of a list object and create your own list variable in r
- 070 how to nest for loops in r and then debug and troubleshoot them if you hit an error you cannot identify
- 071 how to create and nest while loops in r
- 072 how to use the mod (modulus) function in r to calculate a remainder
- 073 how to run your first regression in r (assuming you know some fancy statistics)
- 074 how to calculate a variance and standard deviation and then normalize data in r
- 075 how to clear variables, clear a workspace, clear your screen, even clear unused computer memory in r
- 076 how to load and then run or execute a script in r with the source function
- 077 how to time a function or other processes in r like a stopwatch with the Sys.time() command
- 078 how to initiate empty objects like data tables and matrices in r
- 079 how the if function (if then) differs from ifelse, single-line if, nested if, and brackets in r
- 080 how to beautify your r script with formatR and the tidy.source function
- 081 how to write a for, while, or repeat loop in r with the break and next commands
- 082 how to test for correlation between two variables in r
- 083 how to plot residuals from a regression in r (assuming you know some fancy statistics)
- 084 how to flip a coin or generate a bernoulli or binomial distribution without quaking in fear in r
- 085 how to export or save a plot in r
- 086 fun one: what is each country’s most current available percent of gross national income spent on education?
- 087 how to split a character string in r with the strsplit function
- 088 how to change options and global settings like the number of decimal places shown in r with the options function
- 089 how to run a block of commands at start-up to do stuff like setting your CRAN mirror permanently with r
- 090 fun one: how to calculate your ten, fifteen, or twenty thousandth day on earth with r
- 091 fun one: how to make perfect pour-over coffee and time your french press steeping with r
(I guess there is no need now to make my own….)
ROC curves and classification
To get back to a question asked after the last course (still on non-life insurance), I will spend some time to discuss ROC curve construction, and interpretation. Consider the dataset we’ve been using last week,
> db = read.table("http://freakonometrics.free.fr/db.txt",header=TRUE,sep=";") > attach(db)
The first step is to get a model. For instance, a logistic regression, where some factors were merged together,
> X3bis=rep(NA,length(X3)) > X3bis[X3%in%c("A","C","D")]="ACD" > X3bis[X3%in%c("B","E")]="BE" > db$X3bis=as.factor(X3bis) > reg=glm(Y~X1+X2+X3bis,family=binomial,data=db)
From this model, we can predict a probability, not a variable,
> S=predict(reg,type="response")
Let denote this variable (actually, we can use the score, or the predicted probability, it will not change the construction of our ROC curve). What if we really want to predict a variable. As we usually do in decision theory. The idea is to consider a threshold , so that
- if , then will be , or “positive” (using a standard terminology)
- si , then will be , or “negative“
Then we derive a contingency table, or a confusion matrix
observed value | |||
predicted
value
|
“positive“ | “négative“ | |
“positive“ | TP | FP | |
“négative“ | FN | TN |
where TP are the so-called true positive, TN the true negative, FP are the false positive (or type I error) and FN are the false negative (type II errors). We can get that contingency table for a given threshold
> roc.curve=function(s,print=FALSE){ + Ps=(S>s)*1 + FP=sum((Ps==1)*(Y==0))/sum(Y==0) + TP=sum((Ps==1)*(Y==1))/sum(Y==1) + if(print==TRUE){ + print(table(Observed=Y,Predicted=Ps)) + } + vect=c(FP,TP) + names(vect)=c("FPR","TPR") + return(vect) + } > threshold = 0.5 > roc.curve(threshold,print=TRUE) Predicted Observed 0 1 0 5 231 1 19 745 FPR TPR 0.9788136 0.9751309
Here, we also compute the false positive rates, and the true positive rates,
- TPR = TP / P = TP / (TP + FN) also called sentivity, defined as the rate of true positive: probability to be predicted positve, given that someone is positive (true positive rate)
- FPR = FP / N = FP / (FP + TN) is the rate of false positive: probability to be predicted positve, given that someone is negative (false positive rate)
The ROC curve is then obtained using severall values for the threshold. For convenience, define
> ROC.curve=Vectorize(roc.curve)
First, we can plot (a standard predicted versus observed graph), and visualize true and false positive and negative, using simple colors
> I=(((S>threshold)&(Y==0))|((S<=threshold)&(Y==1))) > plot(S,Y,col=c("red","blue")[I+1],pch=19,cex=.7,,xlab="",ylab="") > abline(v=threshold,col="gray")
And for the ROC curve, simply use
> M.ROC=ROC.curve(seq(0,1,by=.01)) > plot(M.ROC[1,],M.ROC[2,],col="grey",lwd=2,type="l")
This is the ROC curve. Now, to see why it can be interesting, we need a second model. Consider for instance a classification tree
> library(tree) > ctr <- tree(Y~X1+X2+X3bis,data=db) > plot(ctr) > text(ctr)
To plot the ROC curve, we just need to use the prediction obtained using this second model,
> S=predict(ctr)
All the code described above can be used. Again, we can plot (observe that we have 5 possible values for , which makes sense since we do have 5 leaves on our tree). Then, we can plot the ROC curve,
An interesting idea can be to plot the two ROC curves on the same graph, in order to compare the two models
> plot(M.ROC[1,],M.ROC[2,],type="l") > lines(M.ROC.tree[1,],M.ROC.tree[2,],type="l",col="grey",lwd=2)
The most difficult part is to get a proper interpretation. The tree is not predicting well in the lower part of the curve. This concerns people with a very high predicted probability. If our interest is more on those with a probability lower than 90%, then, we have to admit that the tree is doing a good job, since the ROC curve is always higher, comparer with the logistic regression.
Regression on variables, or on categories?
I admit it, the title sounds weird. The problem I want to address this evening is related to the use of the stepwise procedure on a regression model, and to discuss the use of categorical variables (and possible misinterpreations). Consider the following dataset
> db = read.table("http://freakonometrics.free.fr/db2.txt",header=TRUE,sep=";")
First, let us change the reference in our categorical variable (just to get an easier interpretation later on)
> db$X3=relevel(as.factor(db$X3),ref="E")
If we run a logistic regression on the three variables (two continuous, one categorical), we get
> reg=glm(Y~X1+X2+X3,family=binomial,data=db) > summary(reg) Call: glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db) Deviance Residuals: Min 1Q Median 3Q Max -3.0758 0.1226 0.2805 0.4798 2.0345 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.39528 0.86649 -6.227 4.77e-10 *** X1 0.51618 0.09163 5.633 1.77e-08 *** X2 0.24665 0.05911 4.173 3.01e-05 *** X3A -0.09142 0.32970 -0.277 0.7816 X3B -0.10558 0.32526 -0.325 0.7455 X3C 0.63829 0.37838 1.687 0.0916 . X3D -0.02776 0.33070 -0.084 0.9331 --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 806.29 on 999 degrees of freedom Residual deviance: 582.29 on 993 degrees of freedom AIC: 596.29 Number of Fisher Scoring iterations: 6
Now, if we use a stepwise procedure, to select variables in the model, we get
> step(reg) Start: AIC=596.29 Y ~ X1 + X2 + X3 Df Deviance AIC - X3 4 587.81 593.81 <none> 582.29 596.29 - X2 1 600.56 612.56 - X1 1 617.25 629.25 Step: AIC=593.81 Y ~ X1 + X2 Df Deviance AIC <none> 587.81 593.81 - X2 1 606.90 610.90 - X1 1 622.44 626.44
So clearly, we should remove the categorical variable if our starting point was the regression on the three variables.
Now, what if we consider the same model, but slightly different: on the five categories,
> X3complete = model.matrix(~0+X3,data=db) > db2 = data.frame(db,X3complete) > head(db2) Y X1 X2 X3 X3A X3B X3C X3D X3E 1 1 3.297569 16.25411 B 0 1 0 0 0 2 1 6.418031 18.45130 D 0 0 0 1 0 3 1 5.279068 16.61806 B 0 1 0 0 0 4 1 5.539834 19.72158 C 0 0 1 0 0 5 1 4.123464 18.38634 C 0 0 1 0 0 6 1 7.778443 19.58338 C 0 0 1 0 0
From a technical point of view, it is exactly the same as before, if we look at the regression,
> reg = glm(Y~X1+X2+X3A+X3B+X3C+X3D+X3E,family=binomial,data=db2) > summary(reg) Call: glm(formula = Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E, family = binomial, data = db2) Deviance Residuals: Min 1Q Median 3Q Max -3.0758 0.1226 0.2805 0.4798 2.0345 Coefficients: (1 not defined because of singularities) Estimate Std. Error z value Pr(>|z|) (Intercept) -5.39528 0.86649 -6.227 4.77e-10 *** X1 0.51618 0.09163 5.633 1.77e-08 *** X2 0.24665 0.05911 4.173 3.01e-05 *** X3A -0.09142 0.32970 -0.277 0.7816 X3B -0.10558 0.32526 -0.325 0.7455 X3C 0.63829 0.37838 1.687 0.0916 . X3D -0.02776 0.33070 -0.084 0.9331 X3E NA NA NA NA --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 806.29 on 999 degrees of freedom Residual deviance: 582.29 on 993 degrees of freedom AIC: 596.29 Number of Fisher Scoring iterations: 6
Both regressions are equivalent. Now, what about a stepwise selection on this new model?
> step(reg) Start: AIC=596.29 Y ~ X1 + X2 + X3A + X3B + X3C + X3D + X3E Step: AIC=596.29 Y ~ X1 + X2 + X3A + X3B + X3C + X3D Df Deviance AIC - X3D 1 582.30 594.30 - X3A 1 582.37 594.37 - X3B 1 582.40 594.40 <none> 582.29 596.29 - X3C 1 585.21 597.21 - X2 1 600.56 612.56 - X1 1 617.25 629.25 Step: AIC=594.3 Y ~ X1 + X2 + X3A + X3B + X3C Df Deviance AIC - X3A 1 582.38 592.38 - X3B 1 582.41 592.41 <none> 582.30 594.30 - X3C 1 586.30 596.30 - X2 1 600.58 610.58 - X1 1 617.27 627.27 Step: AIC=592.38 Y ~ X1 + X2 + X3B + X3C Df Deviance AIC - X3B 1 582.44 590.44 <none> 582.38 592.38 - X3C 1 587.20 595.20 - X2 1 600.59 608.59 - X1 1 617.64 625.64 Step: AIC=590.44 Y ~ X1 + X2 + X3C Df Deviance AIC <none> 582.44 590.44 - X3C 1 587.81 593.81 - X2 1 600.73 606.73 - X1 1 617.66 623.66
What do we get now? This time, the stepwise procedure recommends that we keep one category (namely C). So my point is simple: when running a stepwise procedure with factors, either we keep the factor as it is, or we drop it. If it is necessary to change the design, by pooling together some categories, and we forgot to do it, then it will be suggested to remove that variable, because having 4 categories meaning the same thing will cost us too much if we use the Akaike criteria. Because this is exactly what happens here
> library(car) > reg = glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db) > linearHypothesis(reg,c("X3A=X3B","X3A=X3D","X3A=0"))
Linear hypothesis test Hypothesis: X3A - X3B = 0 X3A - X3D = 0 X3A = 0 Model 1: restricted model Model 2: Y ~ X1 + X2 + X3 Res.Df Df Chisq Pr(>Chisq) 1 996 2 993 3 0.1446 0.986
So here, we should pool together categories A, B, D and E (which was here the reference). As mentioned in a previous post, it is necessary to pool together categories that should be pulled together as soon as possible. If not, the stepwise procedure might yield to some misinterpretations.