Usually, the sentence that I keep saying in my regression classes is “please, look at your data“. In our previous post, we’ve been playing like most econometricians: we did not look at the data. Actually, if we look at the distribution of individual losses, in the dataset, we see the following,
It looks like there are fixed costs claims in our database. How do we deal with it in the standard case (e.g. in Loss Models textbook) ? We can use a mixture of – at least – three distributions here,
with
- a distribution for small claims, , e.g. an exponential distribution
- a Dirac mass in , i.e.
- a distribution for larger claims, , e.g. a Gamma, or a lognormal, distribution
> I1=which(couts$cout<1120) > I2=which((couts$cout>=1120)&(couts$cout<1220)) > I3=which(couts$cout>=1220) > (p1=length(I1)/nrow(couts)) [1] 0.3284823 > (p2=length(I2)/nrow(couts)) [1] 0.4152807 > (p3=length(I3)/nrow(couts)) [1] 0.256237 > X=couts$cout > (kappa=mean(X[I2])) [1] 1171.998 > X0=X[I3]-kappa > u=seq(0,10000,by=20) > F1=pexp(u,1/mean(X[I1])) > F2= (u>kappa) > F3=plnorm(u-kappa,mean(log(X0)),sd(log(X0))) * (u>kappa) > F=F1*p1+F2*p2+F3*p3 > lines(u,F)
In our previous post, we’ve discussed the idea that all parameters might be related to some covariates, i.e.
which yield the following premium model,
For the , and terms, that’s easy, we can use standard models we’ve seen in the course. For the probability, we should use a multinomial model. Recall that for the logistic regression model, if , then
i.e.
and
To derive a multivariate extension, write
and
Again, maximum likelihood techniques can be used, since
where here, variable – which take three levels – is splitted in three indicators (like any categorical explanatory variables in standard regression model). Thus,
and, as for the logistic regression, then use Newton Raphson’ algorithm to compute numerically the maximum likelihood. In R, first we have to define the levels, e.g.
> seuils=c(0,1120,1220,1e+12) > couts$tranches=cut(couts$cout,breaks=seuils, + labels=c("small","fixed","large")) > head(couts,5) nocontrat no garantie cout exposition zone puissance agevehicule 1 1870 17219 1RC 1692.29 0.11 C 5 0 2 1963 16336 1RC 422.05 0.10 E 9 0 3 4263 17089 1RC 549.21 0.65 C 10 7 4 5181 17801 1RC 191.15 0.57 D 5 2 5 6375 17485 1RC 2031.77 0.47 B 7 4 ageconducteur bonus marque carburant densite region tranches 1 52 50 12 E 73 13 large 2 78 50 12 E 72 13 small 3 27 76 12 D 52 5 small 4 26 100 12 D 83 0 small 5 46 50 6 E 11 13 large
Then, we can run a multinomial regression, from
> library(nnet)
using some selected covariates
> reg=multinom(tranches~ageconducteur+agevehicule+zone+carburant,data=couts) # weights: 30 (18 variable) initial value 2113.730043 iter 10 value 2063.326526 iter 20 value 2059.206691 final value 2059.134802 converged
The output is here
> summary(reg) Call: multinom(formula = tranches ~ ageconducteur + agevehicule + zone + carburant, data = couts) Coefficients: (Intercept) ageconducteur agevehicule zoneB zoneC fixed -0.2779176 0.012071029 0.01768260 0.05567183 -0.2126045 large -0.7029836 0.008581459 -0.01426202 0.07608382 0.1007513 zoneD zoneE zoneF carburantE fixed -0.1548064 -0.2000597 -0.8441011 -0.009224715 large 0.3434686 0.1803350 -0.1969320 0.039414682 Std. Errors: (Intercept) ageconducteur agevehicule zoneB zoneC zoneD fixed 0.2371936 0.003738456 0.01013892 0.2259144 0.1776762 0.1838344 large 0.2753840 0.004203217 0.01189342 0.2746457 0.2122819 0.2151504 zoneE zoneF carburantE fixed 0.1830139 0.3377169 0.1106009 large 0.2160268 0.3624900 0.1243560
To visualize the impact of a covariate (one, only), one can use also spline functions
> library(splines) > reg=multinom(tranches~agevehicule,data=couts) # weights: 9 (4 variable) initial value 2113.730043 final value 2072.462863 converged > reg=multinom(tranches~bs(agevehicule),data=couts) # weights: 15 (8 variable) initial value 2113.730043 iter 10 value 2070.496939 iter 20 value 2069.787720 iter 30 value 2069.659958 final value 2069.479535 converged
For instance, if the covariate is the age of the car, we do have the following probabilities
> predict(reg,newdata=data.frame(agevehicule=5),type="probs") small fixed large 0.3388947 0.3869228 0.2741825
and for all ages from 0 to 20,
For instance, for new cars, the proportion of fixed costs is rather small (here in purple), and keeps increasing with the age of the car. If the covariate is the density of population in the area the driver lives, we do obtain the following probabilities
> reg=multinom(tranches~bs(densite),data=couts) # weights: 15 (8 variable) initial value 2113.730043 iter 10 value 2068.469825 final value 2068.466349 converged > predict(reg,newdata=data.frame(densite=90),type="probs") small fixed large 0.3484422 0.3473315 0.3042263
Based on those probabilities, it is then possible to derive the expected cost of a claims, given some covariates (e.g. the density). But first, define subsets of the whole dataset
> sbaseA=couts[couts$tranches=="small",] > sbaseB=couts[couts$tranches=="fixed",] > sbaseC=couts[couts$tranches=="large",]
with a threshold given by
> (k=mean(sousbaseB$cout)) [1] 1171.998
Then, let us run our four models,
> reg=multinom(tranches~bs(densite),data=couts) > regA=glm(cout~bs(densite),data=sousbaseA,family=Gamma(link="log")) > regB=glm(cout~1,data=sousbaseB,family=Gamma(link="log")) > regC=glm((cout-k)~bs(densite),data=sousbaseC,family=Gamma(link="log"))
We can now compute predictions based on those models,
> nouveau=data.frame(densite=seq(10,100)) > proba=predict(reg,newdata=nouveau,type="probs") > predA=predict(regA,newdata=nouveau,type="response") > predB=predict(regB,newdata=nouveau,type="response") > predC=predict(regC,newdata=nouveau,type="response")+k > pred=cbind(predA,predB,predC)
To visualize the impact of each component on the premium, we can compute probabilities, are well as expected costs (given a cost in each subset),
> cbind(proba,pred)[seq(10,90,by=10),] small fixed large predA predB predC 10 0.3344014 0.4241790 0.2414196 423.3746 1171.998 7135.904 20 0.3181240 0.4471869 0.2346892 428.2537 1171.998 6451.890 30 0.3076710 0.4626572 0.2296718 438.5509 1171.998 5499.030 40 0.3032872 0.4683247 0.2283881 451.4457 1171.998 4615.051 50 0.3052378 0.4620219 0.2327404 463.8545 1171.998 3961.994 60 0.3136136 0.4417057 0.2446807 472.3596 1171.998 3586.833 70 0.3279413 0.4056971 0.2663616 473.3719 1171.998 3513.601 80 0.3464842 0.3534126 0.3001032 463.5483 1171.998 3840.078 90 0.3652932 0.2868006 0.3479061 440.4925 1171.998 4912.379
Now, it is possible to plot those figures in a graph,
> barplot(t(proba*pred)) > abline(h=mean(couts$cout),lty=2)
(the dotted horizontal line is the average cost of a claim, in our dataset).