Segmentation en tarification, compléments

January 10, 2013, 8:34 am

Dans le premier cours d’actuariat IARD, nous avons vu l’importance de la ségmentation, et son implication sur le calcul des primes (passer d’une espérance mathématique à une espérance conditionnelle). Pour aller un peu plus loin, quelques compléments,

chapitre 9,10, 11 dans Wendler & Modlin (2009) http://casact.org/lib…

pour une lecture plus économique de la problématique de la segmentation en assurance

Hoy (1982) http://jstor.org/…
Dahlby (1983) http://ppge.ufrgs.br/…

ou pour une lecture plus légale

Paris (2004) http://books.google.ca/…

Sinon, plusieurs articles de vulgarisation peuvent être lu sur internet,

Éthique, statistique et tarification http://ffsa.fr/…
La segmentation : un fondement essentiel de l’assurance http://ffsa.fr/…
La segmentation dans les assurances… http://actu-assurance-sante.net/…
Jusqu’où faut-il segmenter? http://lalibre.be/…

La première démo aura lieu lundi, en salle informatique. Karim sera une introduction au langage R, à la manipulation des variables (qualitatives et quantitatives). Je mettrais en ligne les transparents en fin de semaine, et les codes seront mis en ligne dans le courant de la semaine prochaine.

Comme annoncé hier, il n’y aura pas cours mercredi prochain. Le mercredi suivant, nous verrons la modélisation des variables indicatrices, i.e. la régression logistique, et les arbres de régression. On supposera que le modèle linéaire aura été vu, je mets un lien vers les transparents du cours ACT6420 de la session passée, notes de cours transparents1 et transparents2. Il est aussi possible de relire Frees (2010), chapitres 3, 4, 5 et 6.

Frees (2010), chapitre 11 (p 305-342)
Greene (2012), sections 17.2 et 17.3 (p 863-715)
de Jong & Heller (2008), chapitre 7

Pour commencer à pratiquer la régression logistique, on utilisera la petite fonction suivante

logit = function(formula, lien="logit", data=NULL) {
glm(formula,family=binomial(link=lien),data)
}

Sinon, la Casualty Actuarial Society a mis en ligne plusieurs documents en ligne sur les arbres de régression (qui sont peu abordés dans les livres mentionnés auparavant),

Derig & Francis (2006) http://casact.org/…
Guszcza (2004) http://casact.org/…
Guszcza (2005) http://casact.org/…
Hadidi (2003) http://casact.org/…

pour une comparaison de toutes les méthodes

Francis (2007) http://casact.org/…
Monsour (2004) http://casact.org/…

Les transparents seront mis en ligne en fin de semaine prochaine. A suivre donc…

Arthur Charpentier

Arthur Charpentier, professor at UQaM in Actuarial Science. Former professor-assistant at ENSAE Paristech, associate professor at Ecole Polytechnique and assistant professor in Economics at Université de Rennes 1. Graduated from ENSAE, Master in Mathematical Economics (Paris Dauphine), PhD in Mathematics (KU Leuven), and Fellow of the French Institute of Actuaries.

More Posts - Website

Follow Me:

↧

R for actuarial science

January 10, 2013, 12:53 pm

≫ Next: Régression logistique et arbres

≪ Previous: Segmentation en tarification, compléments

As mentioned in the Appendix of Modern Actuarial Risk Theory, “R (and S) is the ‘lingua franca’ of data analysis and statistical computing, used in academia, climate research, computer science, bioinformatics, pharmaceutical industry, customer analytics, data mining, finance and by some insurers. Apart from being stable, fast, always up-to-date and very versatile, the chief advantage of R is that it is available to everyone free of charge. It has extensive and powerful graphics abilities, and is developing rapidly, being the statistical tool of choice in many academic environments.”

R is based on the S statistical programming language developed by Joe Chambers at Bell labs in the 80’s. To be more specific, R is an open-source implementation of the S language, developed by Robert Gentlemn and Ross Ihaka. It is a vector based language, which makes it extremely interesting for actuarial computations. For instance, consider some Life Tables,

> TD[39:52,]       > TV[39:52,]
     Age    Lx         Age    Lx
  39  38 95237          38 97753
  40  39 94997          39 97648
  41  40 94746          40 97534
  42  41 94476          41 97413
  43  42 94182          42 97282
  44  43 93868          43 97138
  45  44 93515          44 96981
  46  45 93133          45 96810
  47  46 92727          46 96622
  48  47 92295          47 96424
  49  48 91833          48 96218
  50  49 91332          49 95995
  51  50 90778          50 95752
  52  51 90171          51 95488

Those (French) Life Tables can be found here

> TD <- read.table(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/TD8890.csv",sep=";",header=TRUE)
> TV <- read.table(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/TV8890.csv",sep=";",header=TRUE)

From those vectors, it is possible to construct the matrix of death probabilities, $http://latex.codecogs.com/gif.latex?\boldsymbol{P}=[\text{%20}_{k}p_x]$ , using for instance

>  Lx <- TD$Lx
>  m <- length(Lx)
>  p <- matrix(0,m,m); d <- p
>  for(i in 1:(m-1)){
+  p[1:(m-i),i] <- Lx[1+(i+1):m]/Lx[i+1]
+  d[1:(m-i),i] <- (Lx[(1+i):(m)]-Lx[(1+i):(m)+1])/Lx[i+1]}
>  diag(d[(m-1):1,]) <- 0
>  diag(p[(m-1):1,]) <- 0
>  q <- 1-p

One can compute easily, e.g., the (curtate) expectation of life defined as

$http://latex.codecogs.com/gif.latex?e_x%20=\mathbb{E}(K_x)=\sum_{k=1}^\infty%20k\cdot%20\text{%20}_{k|1}q_x%20=%20\sum_{k=1}^\infty%20\text{%20}_{k}p_x$

and one can compute the vector of life expectancy, at various ages $http://latex.codecogs.com/gif.latex?\boldsymbol{e}=[e_x]$ , as

> life.exp = function(x){sum(p[1:nrow(p),x])}
> e = Vectorize(life.exp)(1:m)

An actually, any kind of actuarial quantity can be derived from those matrices. The expected present value (or actuarial value) of a temporary life annuity-due is, for instance,

$http://latex.codecogs.com/gif.latex?\ddot{a}_{x:\overline{n}|}=\sum_{k=0}^{n-1}%20\nu^k%20\cdot%20{}_{k}p_x%20=\frac{1-A_{x:\overline{n}|}}{1-\nu}$

The code to compute those functions is here

> for(j in 1:(m-1)){ adots[,j]<-cumsum(1/(1+i)^(0:(m-1))*c(1,p[1:(m-1),j])) }

or consider the expected present value of a term insurance

$http://latex.codecogs.com/gif.latex?%20A^1_{x:\overline{n}|}%20=\sum_{k=0}^{n-1}%20\nu^{k+1}%20\cdot%20\text{%20}_{k|}q_x$

with the following code

> for(j in 1:(m-1)){ A[,j]<-cumsum(1/(1+i)^(1:m)*d[,j]) }

Some more details can be found in the first part of the notes of the crash courses of last summer, in Meielisalp. Vector – or matrices – are extremely convenient to work with, when dealing with life contingencies. It is also possible to model prospective mortality. Here, the mortality is not only function of the age , but also time ,

> t(DTF)[1:10,1:10]
    1899  1900  1901  1902  1903  1904  1905  1906  1907  1908
0  64039 61635 56421 53321 52573 54947 50720 53734 47255 46997
1  12119 11293 10293 10616 10251 10514  9340 10262 10104  9517
2   6983  6091  5853  5734  5673  5494  5028  5232  4477  4094
3   4329  3953  3748  3654  3382  3283  3294  3262  2912  2721
4   3220  3063  2936  2710  2500  2360  2381  2505  2213  2078
5   2284  2149  2172  2020  1932  1770  1788  1782  1789  1751
6   1834  1836  1761  1651  1664  1433  1448  1517  1428  1328
7   1475  1534  1493  1420  1353  1228  1259  1250  1204  1108
8   1353  1358  1255  1229  1251  1169  1132  1134  1083   961
9   1175  1225  1154  1008  1089   981  1027  1025   957   885

Thus, we now have a force of mortality matrix $http://latex.codecogs.com/gif.latex?\boldsymbol{\mu}=[\mu_{x,t}]$ , or surface

It is also possible to use R packages to estimate a Lee-Carter model of the mortality rate,

$http://latex.codecogs.com/gif.latex?\log%20\mu%20_{x,t}%20=\alpha%20_{x}%20+\beta%20_{x}%20\cdot%20\kappa_{t}%20+\varepsilon%20_{x,t}$

> library(demography)
> MUH =matrix(DEATH$Male/EXPOSURE$Male,nL,nC)
> POPH=matrix(EXPOSURE$Male,nL,nC)
> BASEH <- demogdata(data=MUH, pop=POPH, ages=AGE, years=YEAR, type="mortality",
+ label="France", name="Hommes", lambda=1)
> RES=residuals(LCH,"pearson")

One can easily study residuals, for instance as a function of the age,

or a function of the year,

Some more details can be found in the second part of the notes of the crash courses of last summer, in Meielisalp.

R is also interesting because of its huge number of libraries, that can be used for predictive modeling. One can easily use smoothing functions in regression, or regression trees,

> TREE = tree((nbr>0)~ageconducteur,data=sinistres,split="gini",mincut = 1)
> age = data.frame(ageconducteur=18:90)
> y1 = predict(TREE,age)
> reg = glm((nbr>0)~bs(ageconducteur),data=sinistres,family="binomial")
> y = predict(reg,age,type="response")

Some practitioners might be scared because the legend claims that R is not as good as SAS to handle large databases. Actually, a lot of functions can be used to import datasets. The most convenient one is probably

> baseCOUT = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  sep=";",header=TRUE,encoding="latin1")
>  tail(baseCOUT,4)
     numeropol  debut_pol    fin_pol freq_paiement langue  type_prof alimentation type_territoire
6512     87291 2002-10-16 2003-01-22       mensuel      A Professeur   Vegetarien          Urbain
6513     87301 2002-10-01 2003-09-30       mensuel      A Technicien   Vegetarien          Urbain
6514     87417 2002-10-24 2003-10-21       mensuel      F Technicien   Vegetalien     Semi-urbain
6515     88128 2003-01-17 2004-01-16       mensuel      F     Avocat   Vegetarien     Semi-urbain
             utilisation presence_alarme marque_voiture sexe exposition age duree_permis age_vehicule i   coutsin
6512 Travail-occasionnel             oui           FORD    M  0.2684932  47           29           28 1 1274.5901
6513              Loisir             oui          HONDA    M  0.9972603  44           24           25 1  278.0745
6514 Travail-occasionnel             non     VOLKSWAGEN    F  0.9917808  23            3           11 1  403.1242
6515              Loisir             non           FIAT    F  0.9972603  23            4           11 1  230.9565

But if the dataset is too large, it is also possible to specify which variables might be interesting, using

> mycols = rep("NULL", 18)
> mycols[c(1,4,5,12,13,14,18)] <- NA
> baseCOUTsubC = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  colClasses = mycols,sep=";",header=TRUE,encoding="latin1")
> head(baseCOUTsubC,4)
  numeropol freq_paiement langue sexe exposition age    coutsin
1         6        annuel      A    M  0.9945205  42   279.5839
2        27       mensuel      F    M  0.2438356  51   814.1677
3        27       mensuel      F    M  1.0000000  53   136.8634
4        76       mensuel      F    F  1.0000000  42   608.7267

It is also possible (before running a code on the entire dataset) to import only the first lines of the dataset.

> baseCOUTsubCR = read.table("http://freakonometrics.free.fr/baseCOUT.csv",
+  colClasses = mycols,sep=";",header=TRUE,encoding="latin1",nrows=100)
> tail(baseCOUTsubCR,4)
    numeropol freq_paiement langue sexe exposition age   coutsin
97       1193       mensuel      F    F  0.9972603  55  265.0621
98       1204       mensuel      F    F  0.9972603  38 9547.7267
99       1231       mensuel      F    M  1.0000000  40  442.7267
100      1245        annuel      F    F  0.6767123  48  179.1925

It is also possible to import a zipped file. The file itself has a smaller size, and it can usually be imported faster.

> import.zip = function(file){
+ temp = tempfile()
+ download.file(file,temp);
+ read.table(unz(temp, "baseFREQ.csv"),sep=";",header=TRUE,encoding="latin1")}
> system.time(import.zip("http://freakonometrics.free.fr/baseFREQ.csv.zip"))
trying URL 'http://freakonometrics.free.fr/baseFREQ.csv.zip'
Content type 'application/zip' length 692655 bytes (676 Kb)
opened URL
==================================================
downloaded 676 Kb
   user  system elapsed 
      0.762       0.029       4.578 
> system.time(read.table("http://freakonometrics.free.fr/baseFREQ.csv", 
+ sep=";",header=TRUE,encoding="latin1"))
   user  system elapsed 
      0.591       0.072       9.277

Finally, note that it is possible to import any kind of dataset, not only a text file. Even a Microsoft Excel folder. On a Windows computer, one can use SQL queries

> sheet = "c:\\Documents and Settings\\user\\excelsheet.xls"
> connection = odbcConnectExcel(sheet)
> spreadsheet = sqlTables(connection)
> query = paste("SELECT * FROM",spreadsheet$TABLE_NAME[1],sep=" ")
> result = sqlQuery(connection,query)

Then, once the dataset is imported, several functions can be used,

> cost = aggregate(coutsin~ AgeSex,mean, data=baseCOUT)
> frequency = merge(aggregate(nbsin~ AgeSex,sum, data=baseFREQ),
+ aggregate(exposition~ AgeSex,sum, data=baseFREQ))
> frequency$freq = frequency$nbsin/frequency$exposition
> base.freq.cost = merge(frequency, cost)

Finally, R is interesting for its graphical interface. “If you can picture it in your head, chances are good that you can make it work in R. R makes it easy to read data, generate lines and points, and place them where you want them. Its very flexible and super quick. When youve only got two or three hours until deadline, R can be brilliant” as said Amanda Cox, a graphics editor at the New York Times. “R is particularly valuable in deadline situations when data is scant and time is precious.”.
Several cases were considered on the blog http ://chartsnthings.tumblr.com/…. First, we start with a simple graph, here State Government control in the US

Then try to find a nice visual representation, e.g.

And finally, you can just print it in your favorite newspaper,

And you can get any kind of graphs,

And not only about politics,

Graphs are important. “Its not just about producing graphics for publication. Its about playing around and making a bunch of graphics that help you explore your data. This kind of graphical analysis is a really useful way to help you understand what you’re dealing with, because if you cant see it, you cant really understand it. But when you start graphing it out, you can really see what you’ve got” as said Peter Aldhous, San Francisco bureau chief of New Scientist magazine. Even for actuaries. “The commercial insurance underwriting process was rigorous but also quite subjective and based on intuition. R enables us to communicate our analytic results in appealing and innovative ways to non-technical audiences through rapid development lifecycles. R helps us show our clients how they can improve their processes and effectiveness by enabling our consultants to conduct analyses efficiently”, as explained by John Lucker, team of advanced analytics professionals at Deloitte Consulting Principal, in http://blog.revolutionanalytics.com/r-is-hot/. See also Andrew Gelman’s view, on graphs, http://www.stat.columbia.edu/…

So yes, actuaries might be interested to use R for actuarial communication, as mentioned in http ://www.londonr.org/…

The Actuarial Toolkit (see http ://www.actuaries.org.uk/…) stresses the interest of R, “The power of the language R lies with its functions for statistical modelling, data analysis and graphics ; its ability to read and write data from various data sources; as well as the opportunity to embed R in excel or other languages like VBA. In the way SAS is good for data manipulations, R is superior for modelling and graphical output“.

From 2011, Asia Capital Reinsurance Group (ACR) uses R to Solve Big Data Challenges (see http ://www.reuters.com/…). And Lloyd’s uses motion charts created with R to provide analysis to investors (as discussed on http ://blog.revolutionanalytics.com/…)

A lot of information can be found on http ://jeffreybreen.wordpress.com/…

Markus Gesmann mentioned on his blog a lot of interesting graphs used for actuarial reporting, http ://lamages.blogspot.ca/…

Further, R is free. Which can be compared with SAS, $6,000 per PC, or $28,000 per processor on a server (as mentioned on http ://en.wikipedia.org/…)

It is also becoming more and more popular, as a programming language. As mentioned on this month Transparent Language Popularity (see http ://lang-index.sourceforge.net/), R is ranked 12. Far away after C or Java, but before Matlab (22) or SAS (27). On StackOverFlow (see http ://stackoverflow.com/) is also far being C++ (399,232 occurrences) or Java (348,418), but with 21,818 occurrences, it appears before Matlab (14,580) and SAS (899). As mentioned on http ://r4stats.com/articles/popularity/ R is becoming more and more popular, on listserv discussion traffic

It is clearly the most popular software in data analysis, as mentioned by the Rexer Analytics survey, in 2009

What about actuaries ? In a survey (see http ://palisade.com/…), R was not extremely popular.

If we consider only statistical softwares, SAS is still far ahead, among UK and CAS actuaries

But, as mentioned by Mike King, Quantitative Analyst, Bank of America, “I cant think of any programming language that has such an incredible community of users. If you have a question, you can get it answered quickly by leaders in the field. That means very little downtime.” This was also mentioned by Glenn Meyers, in the Actuarial Review “The most powerful reason for using R is the community” (in http ://nytimes.com/…). For instance, http ://r-bloggers.com/ has contributions from more than 425 R users.

As said by Bo Cowgill, from Google “The best thing about R is that it was developed by statisticians. The worst thing about R is that it was developed by statisticians.”

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Régression logistique et arbres

January 21, 2013, 7:26 am

≫ Next: Régression de Poisson, et biais minimal

≪ Previous: R for actuarial science

Pour le cours de mercredi prochain, la base utilisée sera une base tirée du livre de Jed Frees, http://instruction.bus.wisc.edu/jfrees/…

> baseavocat=read.table("http://freakonometrics.free.fr/AutoBI.csv",header=TRUE,sep=",")
> tail(baseavocat)
     CASENUM ATTORNEY CLMSEX MARITAL CLMINSUR SEATBELT CLMAGE  LOSS
1335   34204        2      2       2        2        1     26 0.161
1336   34210        2      1       2        2        1     NA 0.576
1337   34220        1      2       1        2        1     46 3.705
1338   34223        2      2       1        2        1     39 0.099
1339   34245        1      2       2        1        1     18 3.277
1340   34253        2      2       2        2        1     30 0.688

On dispose d’une variable dichotomique indiquant si un assuré – suite à un accident de la route – a été représenté par un avocat (1 si oui, 2 si non). On connaît le sexe de l’assuré (1 pour les hommes et 2 pour les femmes), le statut marital (1 s’il est marié, 2 s’il est célibataire, 3 pour un veuf, et 4 pour un assuré divorcé). On sait aussi si l’assuré portait ou non une ceinture de sécurité lorsque l’accident s’est produit (1 si oui, 2 si non et 3 si l’information n’est pas connue). Enfin, une information pour savoir si le conducteur du véhicule était ou non assuré (1 si oui, 2 si non et 3 si l’information n’est pas connue). On va recoder un peu les données afin de les rendre plus claires à lire.

Les transparents du cours sont en ligne ici,

Sur les arbres de régression, je mettrais en ligne un billet, afin d’illustrer la méthode. En attendant des compléments théoriques peuvent se trouver en ligne http://genome.jouy.inra.fr/…, http://ensmp.fr/…, ou http://ujf-grenoble.fr/… (pour information, nous ne verrons que la méthode CART). Je peux renvoyer au livre (et au blog) de Stéphane Tuffery, ou (en anglais) au livre de Richard Berk, dont un résumé se trouve en ligne sur http://crim.upenn.edu/….

La semaine suivante, nous aborderons la régression de Poisson, les méthodes de biais minimal, et introduirons les modèles linéaires généralisés. Je renvoie au chapitre sur la tarification a priori du Denuit & Charpentier (2005), aux chapitres 12 et 13 de Frees (2010) ou encore les chapitres 5 et 6 du De Jong & Heller (2008). Pour les plus curieux qui veulent comprendre les liens entre les modèles linéaires généralisés et la tarification par crédibilité, je renvoie à l’article de Klinker (2010)

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Régression de Poisson, et biais minimal

January 25, 2013, 1:09 pm

≫ Next: Regression tree using Gini’s index

≪ Previous: Régression logistique et arbres

Lors du prochain cours d’actuariat, nous allons finir les arbres de régression, et introduire la régression de Poisson. Les transparents sont en ligne ici,

Je vais présenter la régression en Poisson, en faisant un parallèle avec la régression logistique, la session suivante portera sur la généralisation obtenue avec les modèles linéaires généralisés. Sur la régression de Poisson, je suggère de lire Frees (2010) chapitre 12 (p 343-361), Greene (2012), section 18.3 (p 802-828) ou encore de Jong Heller (2008) chapitre 6. Sur les méthodes de biais minimal, de Jong Heller (2008), section 1.3 et l’article de Sholom Feldblum, http://www.casact.org/…. Sur le passage de ces dernières méthodes (introduites par Robert Bailey dans les années 60, http://www.casact.org/… et http://www.casact.org/…), je recommande la lecture de l’article de Ben Zehnwirth, Ratemaking From Bailey and Simon (1960) to Generalized Linear Regression Models, en ligne sur http://www.casact.org/…

Comme annoncé au premier cours, j’essaye de mettre en ligne les transparents au fur et à mesure, mais j’avais pris l’habitude d’écrire au tableau ces dernières années. Il faut donc que je tape tout. Pour le devoir un courriel sera envoyé d’ici la fin de semaine à tous les groupes qui se sont inscrits.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Regression tree using Gini’s index

January 26, 2013, 10:39 pm

≫ Next: The law of small numbers

≪ Previous: Régression de Poisson, et biais minimal

In order to illustrate the construction of regression tree (using the CART methodology), consider the following simulated dataset,

> set.seed(1)
> n=200
> X1=runif(n)
> X2=runif(n)
> P=.8*(X1<.3)*(X2<.5)+
+   .2*(X1<.3)*(X2>.5)+
+   .8*(X1>.3)*(X1<.85)*(X2<.3)+
+   .2*(X1>.3)*(X1<.85)*(X2>.3)+
+   .8*(X1>.85)*(X2<.7)+
+   .2*(X1>.85)*(X2>.7) 
> Y=rbinom(n,size=1,P)  
> B=data.frame(Y,X1,X2)

with one dichotomos varible (the variable of interest, ), and two continuous ones (the explanatory ones and ).

> tail(B)
    Y        X1        X2
195 0 0.2832325 0.1548510
196 0 0.5905732 0.3483021
197 0 0.1103606 0.6598210
198 0 0.8405070 0.3117724
199 0 0.3179637 0.3515734
200 1 0.7828513 0.1478457

The theoretical partition is the following

Here, the sample can be plotted below (be careful, the first variate is on the y-axis above, and the x-axis below) with blue dots when equals one, and red dots when is null,

> plot(X1,X2,col="white")
> points(X1[Y=="1"],X2[Y=="1"],col="blue",pch=19)
> points(X1[Y=="0"],X2[Y=="0"],col="red",pch=19)

In order to construct the tree, we need a partition critera. The most standard one is probably Gini’s index, which can be writen, when ‘s are splited in two classes, denoted here

L'image “http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-04.png” ne peut être affichée car elle contient des erreurs.

or when ‘s are splited in three classes, denoted

etc. Here, are just counts of observations that belong to partition such that takes value . But it is possible to consider other criteria, such as the chi-square distance,

where, classically

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-02.png

when we consider two classes (one knot) or, in the case of three classes (two knots)

http://perso.univ-rennes1.fr/arthur.charpentier/latex/arbre-comp-05.png

Here again, the idea is to maximize that distance: the idea is to discriminate, so we want samples as not independent as possible. To compute Gini’s index consider

> GINI=function(y,i){
+ T=table(y,i)
+ nx=apply(T,2,sum)
+ pxy=T/matrix(rep(nx,each=2),2,ncol(T))
+ vxy=pxy*(1-pxy)
+ zx=apply(vxy,2,sum)
+ n=sum(T)
+ -sum(nx/n*zx)
+ }

We simply construct the contingency table, and then, compute the quantity given above. Assume, first, that there is only one explanatory variable. We split the sample in two, with all possible spliting values , i.e.

Then, we compute Gini’s index, for all those values. The knot is the value that maximizes Gini’s index. Once we have our first knot, we keep it (call it, from now on ). And we reiterate, by seeking the best second choice: given one knot, consider the value that splits the sample in three, and give the highest Gini’s index, Thus, we consider either the following partition

or this one

I.e. we cut either below, or above the previous knot. And we iterate. The code can be something like that,

> X=X2
> u=(sort(X)[2:n]+sort(X)[1:(n-1)])/2
> knot=NULL
> for(s in 1:4){
+ vgini=rep(NA,length(u))
+ for(i in 1:length(u)){
+ kn=c(knot,u[i])
+ F=function(x){sum(x<=kn)}
+ I=Vectorize(F)(X)
+ vgini[i]=GINI(Y,I)
+ }
+ plot(u,vgini)
+ k=which.max(vgini)
+ cat("knot",k,u[k],"\n")
+ knot=c(knot,u[k])
+ u=u[-k]
+ }
knot 69 0.3025479 
knot 133 0.5846202 
knot 72 0.3148172 
knot 111 0.4811517

At the first step, the value of Gini’s index was the following,

which was maximal around 0.3. Then, this value is considered as fixed. And we try to construct a partition in three parts (spliting either below or above 0.3). We get the following plot for Gini’s index (as a function of this second knot)

which is maximum when the split the sample around 0.6 (which becomes our second knot). Etc. Now, let us compare our code with the standard R function,

> tree(Y~X2,method="gini")
node), split, n, deviance, yval
      * denotes terminal node

 1) root 200 49.8800 0.4750  
   2) X2 < 0.302548 69 12.8100 0.7536 *
   3) X2 > 0.302548 131 28.8900 0.3282  
     6) X2 < 0.58462 65 16.1500 0.4615  
      12) X2 < 0.324591 7  0.8571 0.1429 *
      13) X2 > 0.324591 58 14.5000 0.5000 *
     7) X2 > 0.58462 66 10.4400 0.1970 *

We do obtain similar knots: the first one is 0.302 and the second one 0.584. So, constructing tree is not that difficult…

Now, what if we consider our two explanatory variables? The story remains the same, except that the partition is now a bit more complex to write. To find the first knot, we consider all values on the two components, and again, keep the one that maximizes Gini’s index,

> n=nrow(B)
> u1=(sort(X1)[2:n]+sort(X1)[1:(n-1)])/2
> u2=(sort(X2)[2:n]+sort(X2)[1:(n-1)])/2
> gini=matrix(NA,nrow(B)-1,2)
> for(i in 1:length(u1)){
+ I=(X1<u1[i])
+ gini[i,1]=GINI(Y,I)
+ I=(X2<u2[i])
+ gini[i,2]=GINI(Y,I)
+ }
> mg=max(gini)
> i=1+sum(mg==max(gini[,2]))
> par(mfrow = c(1, 2))
> plot(u1,gini[,1],ylim=range(gini),col="green",type="b",xlab="X1",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==1){points(u1[which.max(gini[,1])],mg,pch=19,col="red")
+          segments(u1[which.max(gini[,1])],mg,u1[which.max(gini[,1])],-100000)}
> plot(u2,gini[,2],ylim=range(gini),col="green",type="b",xlab="X2",ylab="Gini index")
> abline(h=mg,lty=2,col="red")
> if(i==2){points(u2[which.max(gini[,2])],mg,pch=19,col="red")
+          segments(u2[which.max(gini[,2])],mg,u2[which.max(gini[,2])],-100000)}
> u2[which.max(gini[,2])]
[1] 0.3025479

The graphs are the following: either we split on the first component (and we obtain the partition on the right, below),

or we split on the second one (and we get the following partition),

Here, it is optimal to split on the second variate, first. And actually, we get back to the one-dimensional case discussed previously: as expected, it is optimal to split around 0.3. This is confirmed with the code below,

> library(tree)
> arbre=tree(Y~X1+X2,data=B,method="gini")
> arbre$frame[1:4,]
     var   n       dev      yval splits.cutleft splits.cutright
1     X2 200 49.875000 0.4750000      <0.302548       >0.302548
2     X1  69 12.811594 0.7536232      <0.800113       >0.800113
4 <leaf>  57  8.877193 0.8070175                               
5 <leaf>  12  3.000000 0.5000000

For the second knot, four cases should be considered: spliting on the second variable (again), either above, or below the previous knot (see below on the left) or spliting on the first one. Then whe have wither a partition below or above the previous knot (see below on the right),

Etc. To visualize the tree, the code is the following

> plot(arbre)
> text(arbre)
> partition.tree(arbre)

Note that we can also visualize the partition. Nice, isn’t it?

To go further, the book Classification and Regression Trees by Leo Breiman (and co-authors) is awesome. Note that there are also interesting sections in the bible Elements of Statistical Learning: Data Mining, Inference, and Prediction by Trevor Hastie, Robert Tibshirani and Jerome Friedman (which can be downloaded from http://www.stanford.edu/~hastie/…)

Arthur Charpentier

More Posts - Website

Follow Me:

↧

The law of small numbers

January 28, 2013, 7:55 am

≫ Next: Introduction aux modèles linéaires généralisés

≪ Previous: Regression tree using Gini’s index

In insurance, the law of large numbers (named loi des grands nombres initially by Siméon Poisson, see e.g. http://en.wikipedia.org/…) is usually mentioned to legitimate large portfolios, because of pooling and diversification: the larger the pool, the more ‘predictable’ the losses will be (in a given period). Of course, under standard statistical assumption, namely finite expected value, and independence (see http://freakonometrics.blog.free.fr/…. for a discussion, in French). Since in insurance, catastrophes are usually rare – and extremely costly – and actuaries might be interested to model occurrence of that small number of events (see e.g. Aldous’ book on that specific topic, that can be downloaded from http://stat.berkeley.edu/…). The theorem behind is sometimes called the law of small numbers (from the book published by Ladislaus Bortkiewicz, but we’ll get back to that story later on, see also Whitaker (1914) http://biomet.oxfordjournals.org/… or the book recently published by Michael Falk, Jürg Hüsler and Rolf-Dieter Reiss).

The Poisson distribution

The so-called Poisson distribution (see http://en.wikipedia.org/…) was introduced by Siméon Poisson in 1837 (in Recherches sur la Probabilité des Jugements en Matière Criminelle et en Matière Civile, Précédées des Règles Générales du Calcul des Probabilités, see http://gallica.bnf.fr/…). But it had been defined more than a century before, by Abraham De Moivre, in 17111, in De Mensura Sortis seu; de Probabilitate Eventuum in Ludis a Casu Fortuito Pendentibus (see e.g. the review in http://www.jstor.org/…). Let denote a counting random variable, then it said to be Poisson distributed if there is $http://latex.codecogs.com/gif.latex?\lambda\in(0,\infty)$ such that

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N=k)=e^{-\lambda}\frac{\lambda^k}{k!},\forall%20k\in\mathbb{N}$

De Moivre obtained that distribution from an approximation of the binomial distribution. Recall that the binomial distribution is a standard distribution in actuarial science, for instance to model the number of deaths among insured. If individual death probabilities are identical, say , and if deaths are independent events, then

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N=k)=\binom{n}{k}p^k(1-p)^{n-k},\forall%20k\in\{0,1,\cdots,n\}$
And if $http://latex.codecogs.com/gif.latex?n\rightarrow\infty$ and $http://latex.codecogs.com/gif.latex?np\rightarrow%20\lambda$ , then

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N=k)\rightarrow%20e^{-\lambda}\frac{\lambda^k}{k!}$ Again, this is an asymptotic theorem, which is valid when we have a lot of observations ( $http://latex.codecogs.com/gif.latex?n\rightarrow\infty$ ), but also, the probability of occurrence should be extremely small (since $http://latex.codecogs.com/gif.latex?p\sim\lambda/n$ ), which is why to use the term small numbers. Siméon Poisson was not interested by mathematical approximations: his main point was to get a distribution with nice goodness of fit properties for the data he was working on. He wanted to get a better understanding of cours d’assises (jury panel, might be a valid translation of the French term). A jury consists of 12 jurors who voted to determine whether a defendant was guilty. When guilt was predominant, with at least 8 votes against 4, the defendant was convicted (which was 47% of criminal cases). 5 with 7 votes against, the opinion of professional judges was requested (11% of criminal trials again). Using these statistics we can demonstrate that a defendant brought before an assize court is guilty of the order of 68%, and the probability that a juror is not wrong by voting (condemning an innocent or releasing a culprit) was about 54%. He sought to calculate the probability that a defendant is wrongfully convicted, and gets 2%. And 28% of exonerated defendants are in fact guilty. Siméon Poisson introduced this law to get probabilities easily. But the law he considered is central in probability….

The law of small numbers

The heuristic of the main theorem, related to the Poisson distribution is the following: let $http://latex.codecogs.com/gif.latex?X_1,%20\cdots,X_n$ denote i.i.d random variables taking values in $http://latex.codecogs.com/gif.latex?%20\mathbb{R}^d$ (in a general setting, one component can be the time, the other one an upper region of interest, where some stochastic process might be). Let $http://latex.codecogs.com/gif.latex?\mathcal{A}_n\subset\mathbb{R}^d$ . If $http://latex.codecogs.com/gif.latex?\mathbb{P}(X_i%20\in%20\mathcal{A}_n)\rightarrow%200$ as $http://latex.codecogs.com/gif.latex?n\rightarrow\infty$ (or $http://latex.codecogs.com/gif.latex?\mathbb{P}(X_i%20\in%20\mathcal{A}_n)=O(n^{-1})$ to be a little bit more specific about the assumptions), let denote the (random variable characterizing) count of events $http://latex.codecogs.com/gif.latex?\{X_i%20\in%20\mathcal{A}_n\}$ , then can be approximated by a Poisson distribution with parameter $http://latex.codecogs.com/gif.latex?\lambda%20=n%20\times%20\mathbb%20P(X_i%20\in%20\mathcal{A}_n)$ .
The heuristic is that if we consider a large number of observations, and if we count how many are in a given (small) region, then the number of such observations is Poisson distributed.

n=1000
X=runif(n)*10-1.5
Y=runif(n)*10-1.5
plot(X,Y,axis=FALSE,cex=.6)
u=seq(-1,1,by=.01)
v=sqrt(1-u^2)
polygon(c(u,rev(u)),c(v,rev(-v)),col="yellow",border=NA)
I=(X^2+Y^2)<1
points(X[I],Y[I],cex=.6,pch=19,col="red")

If we run some simulations,

>  n=1000
>  ns=100000
>  N=rep(NA,ns)
> for(s in 1:ns){
+ X=runif(n)*10-1.5
+ Y=runif(n)*10-1.5
+ I=(X^2+Y^2)<1
+ N[s]=sum(I)
+ }
> hist(N,breaks=0:60,probability=TRUE,col="yellow")
> mean(N)
[1] 31.41257

The parameter of the Poisson distribution is the area of the yellow disk, over the area of the square, i.e.

> (lambda=10*pi)
[1] 31.41593
> lines(0:60-.5,dpois(0:60,lambda),type="b",col="red")

To get an interpretation related to insurance modeling, let $http://latex.codecogs.com/gif.latex?\mathcal{A}$ denote an upper layer in a reinsurance contract, i.e. $http://latex.codecogs.com/gif.latex?\mathcal{A}=\{x%3Ed\}$ for some deductible . Let ‘s denote individual losses. Then the number of claims that hit this upper layer can be modeled with a Poisson distribution. More precisely, if deductible becomes extremely large (and $http://latex.codecogs.com/gif.latex?\mathbb{P}(X_i%20\in%20\mathcal{A})\rightarrow%200$ ), we obtain the point-over-threshold model in extreme value theory (see e.g. http://brale.math.hr/~iugrina/… or http://fire.nist.gov/bfrlpubs/…): if has a Poisson distribution and, conditionally on , $http://latex.codecogs.com/gif.latex?X_1,\cdots,X_N$ are independent identically distributed generalized Pareto random variables, then $http://latex.codecogs.com/gif.latex?\max\{X_1,\cdots,X_N\}$ has the generalized extreme value distribution. Thus, exceedances models (for rare events) are closely related to Poisson processes.

The Poisson process

As mentioned above, the Poisson distribution appears when events occur somehow randomly and independently, over time. It is then natural to study the time between two occurences (or two claims, in an insurance context).

Poisson distribution, and claims occurrence

It is neither Siméon Poisson nor De Moivre, but Ladislaus Von Bortkiewicz who first mentioned the Poisson distribution as the law of small numbers. In 1898 (see http://archive.org/…), he studied the number number of soldiers killed by being kicked by a horse, from 1875 till 1894, in 200 corps (more precisely 10 corps over 20 ans).

He did obtain the following distribution (here, the parameter of the Poisson distribution is 0.61, i.e. the average number of death per year)

number of death per year	Empirical counts	Poisson distribution
0	109	108.67
1	65	66.21
2	22	20.22
3	3	4.11
4	1	0.63
5 and more	0	0.08

It is possible to find a lot of cases where the Poisson distribution fits extremely well. For instance, if we consider the number of hurricanes, that landed in Florida after 1850,

number of hurricanes per year	empirical frequency	Poisson frequency
0	30	27.16
1	48	47.99
2	37	42.41
3	29	24.98
4	8	11.03
5	3	3.90
6	3	1.15
7	1	0.29
8 and more	0	0.08

Poisson distribution, and return period

The return period was introduced by Emil Gumbel, in hydrology, to link probabilities and durations (see e.g. http://freakonometrics.blog.free.fr/…). A decennial event has an occurence probability of 1/10. 10 is then the average waiting time before occurence. This does not mean that the event will not occur before 10 years, or has to occur before 10 years. Consider a return period (in years), then the yearly probability of non-occurrence is .

And the probability of non-occurence over years is then . It is standard to summarize this property with the following table,

	return period
Number of years () without catastrophes		10	20	50	100	200
	10	65.1%	40.1%	18.3%	9.6%	4.9%
	20	87.8%	64.2%	33.2%	18.2%	9.5%
	50	99.5%	92.3%	63.6%	39.5%	22.5%
	100	99.9%	99.4%	86.7%	63.4%	39.5%
	200	99.9%	99.9%	98.2%	86.6%	63.3%

The diagonal in the table above is extremely interesting. It looks like there is some kind of convergence towards a limiting value (here 63.2%). Indeed, the number of events observed over n years have a Binomial distribution, with probability , which will converge towards the Poisson distribution with parameter 1. The probability of not having a catastrophe is then $http://latex.codecogs.com/gif.latex?1-\exp(-1)$ , which is equal to 0.632.

Rare probabilities and the Poisson distribution

The Poisson distribution keeps appearing when computing probabilies of rare events. For instance, the probability to have at least one incident in a nuclear plant in France, over a 50 year period. Assume that the annual probability of an incident in a reactor is small, e.g. 0.05%. Assume further that reactors are independent among them, and in time. The probability to have an incident over 80 reactors in 50 years is (exactly)

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N\neq0)=1-(1-p)^{50%20\times%2080}$

Of course, a linear approximation is not correct (even if it was mentioned in some French newspaper, as explained in an old post http://freakonometrics.blog.free.fr/…)

$http://latex.codecogs.com/gif.latex?\mathbb%20P(N\neq%200)\neq%2050\times%2080\times%20p$

On the other hand

$http://latex.codecogs.com/gif.latex?\mathbb%20P(N\neq 0)=1-(1-p)^{50\times80%20}%20\sim1-\exp\left(-50\times80\times%20p%20\right)$

> p=0.00005
> 1-(1-p)^(50*80)
[1] 0.1812733
> 1-exp(-50*80*p)
[1] 0.1812692

which is the probability that is null when has a Poisson distribution with parameter $http://latex.codecogs.com/gif.latex?\lambda=50\times80\times%20p$ . We clearly see here an application of De Moivre’s approximation in risk management.

Another way of looking at this problem is based on the following idea: given the fact that in 45 years of observations on 450 reactors worldwide (roughly), three major accidents were observed including Three Mile Island (1979) and Fukushima (2011), i.e. the average time between accidents can be estimated at 16 years. For a single reactor, we can assume that the average time to wait before an incident is 450 times 16 years, i.e 7200 years. Or the probability to have one incident, over one year, for one reactor is 1 over 7200 (this is the idea behind the return period concept). If we assume that the arrival of accidents occurs randomly and independently of each other (as defined above) then the number of major accidents observed over a period of 50 years in France follows a Poisson distribution with parameter 50 / (7200/80). Also, the probability of having no major accident over 50 years, with 80 reactors can be estimated by

$http://latex.codecogs.com/gif.latex?1-\exp(-50\times%2080/7200)$

i.e.

> 1-exp(-50*80/7200)
[1] 0.4262466

(keeping in mind all the uncertainty around the estimated waiting time before a major accident to a single reactor!).

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Introduction aux modèles linéaires généralisés

January 29, 2013, 3:07 am

≫ Next: Base de données de tarification

≪ Previous: The law of small numbers

J’ai un peu d’avance dans le cours. Je vais mettre en ligne les transparents pour la semaine prochaine (normalement), où nous aborderons la classe des modèles linéaires généralisés. Les transparents sont en ligne ici.

Je n’ai pas mis de section sur lesGeneralized Additive Models, on se contentera de la section sur le lissage évoquée à la fin des transparents sur la modélisation de la fréquence. Afin de légitimer les méthodes de lissage (sur l’âge de l’assuré en particulier), je renvoie vers un graphique produit il y a plusieurs années par un cabinet de conseil, qui notait que la forme de la fonction de lissage, liant l’âge à la fréquence est identique, dans tous les pays,

Mais je pense que je ferais un billet dédié au lissage, dans la problématique de la tarification en assurance IARD.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Base de données de tarification

January 30, 2013, 1:26 pm

≫ Next: Overdispersion with different exposures

≪ Previous: Introduction aux modèles linéaires généralisés

Pour compléter le cours de ce matin, un mot rapide sur les bases, et plus particulièrement la base de contrats. Au sujet des variables,

densite est la densité de population dans la commune où habite le conducteur principal,
zone : zone A B C D E ou F, selon la densité en nombre d’habitants par km2 de la commune de résidence (A =”1-50″, B=”50-100″, C=”100-500″, D=”500-2,000″, E=”2,000-10,000″, F=”10,000+”.

A titre d’information, la répartition de la population en France se fait de la manière suivante

marque : marque du véhicule selon la table suivante (1 Renault Nissan; 2 Peugeot Citroën ; 3 Volkswagen Audi Skoda Seat ; 4 Opel GM; 5 Ford ; 6 Fiat ; 10 Mercedes Chrysler ; 11 BMW Mini ;12 Autres japonaises et coréennes ; 13 Autres européennes ; 14 Autres marques et marques inconnues). Cette variable n’est pas une variable numérique
region : code à 2 chiffres (ce qui n’est pas une valeur numérique) donnant les 22 régions françaises (code INSEE), soit géographiquement

ageconducteur : âge du conducteur principal en début de la couverture,
agevehicule : âge du véhicule en début de période.

Je demande de ne pas utiliser la variable de bonus, qui fait intervenir une information utilisée en tarification a posteriori (qui ne fait pas l’objet de ce cours).

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Overdispersion with different exposures

February 1, 2013, 8:17 am

≫ Next: Fréquence de sinistres, et surdispersion

≪ Previous: Base de données de tarification

In actuarial science, and insurance ratemaking, taking into account the exposure can be a nightmare (in datasets, some clients have been here for a few years – we call that exposure – while others have been here for a few months, or weeks). Somehow, simple results because more complicated to compute just because we have to take into account the fact that exposure is an heterogeneous variable.

The exposure in insurance ratemaking can be seen as a problem of censored data (in my dataset, the exposure is always smaller than 1 since observations are contracts, not policyholders),

the number of claims on the period is unobserved
the number of claims on is observed (as well as )

And as always, the variable of interest is the unobserved one, because we have to price insurance contract with a cover period of one (full) year. So we have to model the yearly frequency of insurance claims.

In our dataset, we have ‘s – or more generally also some additional covariates $http://latex.codecogs.com/gif.latex?(Y_i,E_i,\boldsymbol{X}_i)$ ‘s. For ratemaking, we need to estimate $http://latex.codecogs.com/gif.latex?\mathbb{E}(N\vert\boldsymbol{X}=\boldsymbol{x})$ and perhaps also $http://latex.codecogs.com/gif.latex?\text{Var}(N|\boldsymbol{X}=\boldsymbol{x})$ (for instance to test if the Poisson assumption is valid, or not). To estimate the expected value, a natural estimate for $http://latex.codecogs.com/gif.latex?\mathbb{E}(N)$ (forget about covariates as a start) is
$http://latex.codecogs.com/gif.latex?m_N=\frac{\sum_{i=1}^n%20Y_i}{\sum_{i=1}^n%20E_i}$
which is also the weight average of annualized individual counts
$http://latex.codecogs.com/gif.latex?m_N=\sum_{i=1}^n%20\frac{%20E_i}{\sum_{i=1}^n%20E_i}%20\cdot%20\frac{Y_i}{E_i}$
We consider the ratio of the total number of claims to the total exposure-to-
risk. This estimate appears for instance if we consider a Poisson process, so that $http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda)$ while $http://latex.codecogs.com/gif.latex?Y\sim\mathcal{P}(\lambda%20\cdot%20E)$ . Then, the likelihood is

$http://latex.codecogs.com/gif.latex?\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})=\prod_{i=1}^n%20\frac{e^{-\lambda%20E_i}%20[\lambda%20E_i]^{Y_i}}{Y_i!}$

i.e.

$http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})%20=%20-\lambda%20\sum_{i=1}^n%20E_i%20+\sum_{i=1}^n%20Y_i%20\log[\lambda%20E_i]%20-%20\log\left(\prod_{i=1}^n%20Y_i!\right)$

The first order condition is here

$http://latex.codecogs.com/gif.latex?\frac{\partial}{\partial%20\lambda}\log%20\mathcal{L}(\lambda,\boldsymbol{Y},\boldsymbol{E})%20=%20%20-%20\sum_{i=1}^n%20E_i%20+\frac{1}{\lambda}\sum_{i=1}^n%20Y_i%20=0$

which is satisfied if

$http://latex.codecogs.com/gif.latex?\widehat{\lambda}=\frac{\sum_{i=1}^n%20Y_i}{\sum_{i=1}^n%20E_i}$

So, we do have an estimator for the expected value, and a natural estimator for $http://latex.codecogs.com/gif.latex?\mathbb{E}(N\vert\boldsymbol{X}=\boldsymbol{x})$ is then (if we consider categorical covariates)
$http://latex.codecogs.com/gif.latex?m_{N|\boldsymbol{x}}%20=\frac{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}}%20Y_i}{\sum_%20{i,\boldsymbol{X}_i=\boldsymbol{x}}%20E_i}$

Now, we need an estimate for the variance, or more precisely the conditional variable. Assume (as a starting point) that all have the same exposure . For instance, if is one half, insured were observed only the first six months. Then with $http://latex.codecogs.com/gif.latex?Y\overset{\mathcal%20L}{=}Y%27$ ( is the number of claims on the first six months, while are the number of claims on the last six months), i.e. $http://latex.codecogs.com/gif.latex?\text{Var}(N)=\text{Var}(Y)+%20\text{Var}(Y%27)$ if we assume independent increments. I.e.
$http://latex.codecogs.com/gif.latex?\text{Var}(N)=2\text{Var}(Y)$ , or conversely $http://latex.codecogs.com/gif.latex?E%20\cdot\text{Var}(N)=\text{Var}(Y)$ . More generally, it is reasonable to assume that

$http://latex.codecogs.com/gif.latex?\text{Var}(Y)=E\cdot%20\text{Var}(N)$
for all values of . And then
$http://latex.codecogs.com/gif.latex?\text{Var}\left(\frac{Y}{E}\right)=\frac{1}{E}\cdot%20\text{Var}(N)$
Thus, it seems legitimate to assume that the empirical variance of can be written
$http://latex.codecogs.com/gif.latex?S_N^2=E\cdot%20S_{Y/E}^2$
Since the average of is $http://latex.codecogs.com/gif.latex?\overline{N}=m_N$ , then
$http://latex.codecogs.com/gif.latex?S_N^2=E\cdot%20\frac{1}{n}\sum_{i=1}^n%20\left[\frac{Y_i}{E}-\overline{N}\right]^2}%20=%20\frac{1}{n}\sum_{i=1}^n%20E\left[\frac{Y_i}{E}-\overline{N}\right]^2}$
or equivalently
$http://latex.codecogs.com/gif.latex?S_N^2=\frac{1}{n}\sum_{i=1}^n%20\frac{E}{E^2}\left[Y_i-\overline{N}\cdot%20E\right]^2}%20=\frac{1}{n}\sum_{i=1}^n%20\frac{1}{E}[Y_i-\overline{N}\cdot%20E]^2$ i.e.
$http://latex.codecogs.com/gif.latex?S_N^2=\frac{\sum_{i=1}^n%20[Y_i-\overline{N}\cdot%20E]^2%20}{nE}$
Thus, with different ‘s, it would be legitimate (I guess) to consider
$http://latex.codecogs.com/gif.latex?S_N^2=\frac{\sum_{i=1}^n%20[Y_i-\overline{N}\cdot%20E_i]^2%20}{\sum_{i=1}^n%20E_i}$
Thus, an estimator for $http://latex.codecogs.com/gif.latex?\text{Var}(N|\boldsymbol{X}=\boldsymbol{x})$ is
$http://latex.codecogs.com/gif.latex?S_{N|\boldsymbol{x}}^2=\frac{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}}%20[Y_i-\overline{N}\cdot%20E_i]^2}{\sum_{i,\boldsymbol{X}_i=\boldsymbol{x}%20}%20E_i}$

This can be used to test is the Poisson assumption is valid to model frequency. Consider the following dataset,

>  sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",
+  header=TRUE,sep=";")
>  sinistres=sinistre[sinistre$garantie=="1RC",]
>  sinistres=sinistres[sinistres$cout>0,]
>  contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt",
+  header=TRUE,sep=";")
>  T=table(sinistres$nocontrat)
>  T1=as.numeric(names(T))
>  T2=as.numeric(T)
>  nombre1 = data.frame(nocontrat=T1,nbre=T2)
>  I = contrat$nocontrat%in%T1
>  T1= contrat$nocontrat[I==FALSE]
>  nombre2 = data.frame(nocontrat=T1,nbre=0)
>  nombre=rbind(nombre1,nombre2)
>  baseFREQ = merge(contrat,nombre)

Here, we do have our two variables of interest, the exposure, per contract,

>  E <- baseFREQ$exposition

and the (observed) number of claims (during that time frame)

>  Y <- baseFREQ$nbre

It is possible to compute without covariates, the average (yearly) number of claims, per contract, and the associated variance

> (mean=weighted.mean(Y/E,E))
[1] 0.07279295
> (variance=sum((Y-mean*E)^2)/sum(E)) 
[1] 0.08778567

It looks like the variance is (slightly) larger than the average (we’ll see in a few weeks how to test it, more formally). It is possible to add covariates, for instance the density of population, in the area where the policyholder lives,

>  X=as.factor(baseFREQ$densite)
>  for(i in 1:length(levels(X))){
+ 	   Ei=E[X==levels(X)[i]]
+ 	   Yi=Y[X==levels(X)[i]]
+  (meani=weighted.mean(Yi/Ei,Ei))    # moyenne 
+  (variancei=sum((Yi-meani*Ei)^2)/sum(Ei))    # variance
+ cat("Density, zone",levels(X)[i],"average =",meani," variance =",variancei,"\n")
+ }
Density, zone 11 average = 0.07962411  variance = 0.08711477 
Density, zone 21 average = 0.05294927  variance = 0.07378567 
Density, zone 22 average = 0.09330982  variance = 0.09582698 
Density, zone 23 average = 0.06918033  variance = 0.07641805 
Density, zone 24 average = 0.06004009  variance = 0.06293811 
Density, zone 25 average = 0.06577788  variance = 0.06726093 
Density, zone 26 average = 0.0688496   variance = 0.07126078 
Density, zone 31 average = 0.07725273  variance = 0.09067 
Density, zone 41 average = 0.03649222  variance = 0.03914317 
Density, zone 42 average = 0.08333333  variance = 0.1004027 
Density, zone 43 average = 0.07304602  variance = 0.07209618 
Density, zone 52 average = 0.06893741  variance = 0.07178091 
Density, zone 53 average = 0.07725661  variance = 0.07811935 
Density, zone 54 average = 0.07816105  variance = 0.08947993 
Density, zone 72 average = 0.08579731  variance = 0.09693305 
Density, zone 73 average = 0.04943033  variance = 0.04835521 
Density, zone 74 average = 0.1188611   variance = 0.1221675 
Density, zone 82 average = 0.09345635  variance = 0.09917425 
Density, zone 83 average = 0.04299708  variance = 0.05259835 
Density, zone 91 average = 0.07468126  variance = 0.3045718 
Density, zone 93 average = 0.08197912  variance = 0.09350102 
Density, zone 94 average = 0.03140971  variance = 0.04672329

Perhaps graphs would be a nice tool to play with, to visualize that information

> plot(meani,variancei,cex=sqrt(Ei),col="grey",pch=19,
+ xlab="Empirical average",ylab="Empirical variance")
> points(meani,variancei,cex=sqrt(Ei))

The size of the circles is related to the size of the group (the area is proportional to the total exposure within the group). The first diagonal corresponds to the Poisson model, i.e. the variance should be equal to the mean. It is also possible to consider other covariates, like the gas type

or the car brand,

It is also possible to consider the age of the driver as a categorical variate

Actually, the age is interesting: we can observe on that dataset a feature that Jean-Philippe Boucher observed also on his own datasets. Let us look more carefully where are the different ages,

On the right, we can observe young (unexperienced) drivers. That was expected. But some classes are below the first diagonal: the expected frequency is large, but not the variance. I.e. we know for sure that young drivers have more car accidents. It is not an heterogeneous class, on the contrary: young drivers can be seen as a relatively homogeneous class, with a high frequency of car accidents.

With the original dataset (here, I use only a subset with 50,000 clients), we do obtain the following graph:

If we do not observe underdispersion for young drivers, observe that those are incredibly homogeneous classes. With a clear impact of experience, since circles are moving downward from age 18 to 25.

Another disturbing story (this was – one more time – suggestion from Jean-Philippe) that it might be possible to consider the exposure as a standard variable, and see if the coefficient is actually equal to 1. Without any covariate,

>  reg=glm(Y~log(E),family=poisson("log"))
>  summary(reg)

Call:
glm(formula = Y ~ log(E), family = poisson("log"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.3988  -0.3388  -0.2786  -0.1981  12.9036  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.83045    0.02822 -100.31   <2e-16 ***
log(E)       0.53950    0.02905   18.57   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12931  on 49999  degrees of freedom
Residual deviance: 12475  on 49998  degrees of freedom
AIC: 16150

Number of Fisher Scoring iterations: 6

i.e. the parameter is clearly strictly smaller than 1. And it is neither related to significance,

> library(car)
> linearHypothesis(reg,"log(E)",1)
Linear hypothesis test

Hypothesis:
log(E) = 1

Model 1: restricted model
Model 2: Y ~ log(E)

  Res.Df Df  Chisq Pr(>Chisq)    
1  49999                         
2  49998  1 251.19  < 2.2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

nor to the fact that I did not take into account covariates,

> reg=glm(nbre~log(exposition)+carburant+as.factor(ageconducteur)+as.factor(densite),family=poisson("log"),data=baseFREQ)
>  summary(reg)

Call:
glm(formula = nbre ~ log(exposition) + carburant + as.factor(ageconducteur) + 
    as.factor(densite), family = poisson("log"), data = baseFREQ)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.7114  -0.3200  -0.2637  -0.1896  12.7104  

Coefficients:
                              Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  -14.07321  181.04892  -0.078 0.938042    
log(exposition)                0.56781    0.03029  18.744  < 2e-16 ***
carburantE                    -0.17979    0.04630  -3.883 0.000103 ***
as.factor(ageconducteur)19    12.18354  181.04915   0.067 0.946348    
as.factor(ageconducteur)20    12.48752  181.04902   0.069 0.945011

(etc). So it might be a too strong assumption to assume that the exposure is an exogenous variate here. But that’s another story !

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Fréquence de sinistres, et surdispersion

February 1, 2013, 1:03 pm

≫ Next: Natura non facit saltus

≪ Previous: Overdispersion with different exposures

Je continue à mettre en ligne les transparents qui serviront de support pour le cours ACT2040. Dans cette dernière partie sur la modélisation de la fréquence de sinistre, on parlera de surdispersion. Les transparents sont en ligne ici,

Sinon, parmi les références complémentaires, je peux suggérer plusieurs documents rédigés par des praticiens, comme Meyers (2009) http://casact.org/education/…, Isamail & Jemain (2009) http://casact.org/pubs/… ou encore le document très intéressant (et critique) de Schmid (2011) http://casact.org/education/…. Les plus motivés pourront aussi survoler les section 2.3 et 2.4. du livre Denuit et al. (2007), en ligne sur http://books.google.ca/…

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Natura non facit saltus

February 5, 2013, 11:58 am

≫ Next: Large claims, and ratemaking

≪ Previous: Fréquence de sinistres, et surdispersion

(see John Wilkins’ article on the – interesting – history of that phrase http://scienceblogs.com/evolvingthoughts/…). We will see, this week in class, several smoothing techniques, for insurance ratemaking. As a starting point, assume that we do not want to use segmentation techniques: everyone will pay exactly the same price.

no segmentation of the premium

And that price should be related to the pure premium, which is proportional to the frequency (or the annualized frequency, as discussed previously), since

$http://latex.codecogs.com/gif.latex?\mathbb{E}_{\mathbb{P}}\left(\sum_{i=1}^N%20Y_i\right)=\mathbb{E}_{\mathbb{P}}(N)%20\cdot%20\mathbb{E}_{\mathbb{P}}(Y_i)$

The probability measure is mentioned here just to recall that we can use any measure. Even $http://latex.codecogs.com/gif.latex?\mathbb{P}_{\boldsymbol{X}}$ (based on some covariates). Without any covariate, the expected frequency should be

> regglm0=glm(nbre~1+offset(log(exposition)),data=sinistres,family=poisson)
> summary(regglm0)

Call:
glm(formula = nbre ~ 1 + offset(log(exposition)), family = poisson, 
    data = sinistres)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.5033  -0.3719  -0.2588  -0.1376  13.2700  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -2.6201     0.0228  -114.9   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 12680  on 49999  degrees of freedom
Residual deviance: 12680  on 49999  degrees of freedom
AIC: 16353

Number of Fisher Scoring iterations: 6

> exp(coefficients(regglm0))
(Intercept) 
 0.07279295

Thus, if we do not want to take into account potential heterogeneity, we should assume that $http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda)$ where $http://latex.codecogs.com/gif.latex?\lambda$ is closed to 7.28%. Yes, as mentioned in class, it is rather common to see $http://latex.codecogs.com/gif.latex?\lambda$ as a percentage, i.e. a probability, since

$http://latex.codecogs.com/gif.latex?\mathbb{P}(N\neq%200)=1-e^{-\lambda}\approx%20\lambda$

i.e. $http://latex.codecogs.com/gif.latex?\lambda$ can be interpreted as the probability of not have a claim (see also the law of small numbers). Let us visualize this as a function of the age of the driver,

> a=18:100
> yp=predict(regglm0,newdata=data.frame(ageconducteur=a,exposition=1),type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> k=23
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

We do predict the same frequency for all drivers, e.g. for some drive aged 40,

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07279295  confidence interval 0.07611196 0.06947393

Let us now consider the case where we try to take into account heterogeneity, e.g. by age,

The (standard) Poisson regression

The idea of the (log-)Poisson regression is to assume that instead of having $http://latex.codecogs.com/gif.latex?N\sim\mathcal{P}(\lambda)$ , we should have $http://latex.codecogs.com/gif.latex?N|\boldsymbol{X}\sim\mathcal{P}(\lambda_{\boldsymbol{X}})$ , where

$http://latex.codecogs.com/gif.latex?\lambda_{\boldsymbol{X}}=\exp(\beta_0+\beta_1%20\boldsymbol{X}_1+\cdots+\beta_k\boldsymbol{X}_k)$

in a very general setting. Here, let us consider only one explanatory variable, i.e.

$http://latex.codecogs.com/gif.latex?\lambda_{X}=\exp(\beta_0+\beta_1%20{X})$

Here, we have

> yp=predict(regglm1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")
> lines(a,yp1,lty=2)
> lines(a,yp2,lty=2)
> points(a[k],yp0[k],pch=3,lwd=3,col="red")
> segments(a[k],yp1[k],a[k],yp2[k],col="red",lwd=3)

i.e. the prediction for the annualized claim frequency for our 40 year old driver is now 7.74% (which is slightly higher than what we had before, 7.28%)

> cat("Frequency =",yp0[k]," confidence interval",yp1[k],yp2[k])
Frequency = 0.07740574  confidence interval 0.08117512 0.07363636

It is possible to compute not the expected frequency , but the ratio $http://latex.codecogs.com/gif.latex?\mathbb{E}(N|X)/\mathbb{E}(N)$ .

Above the horizontal blue line, the premium will be higher than the one obtained without segmentation, and (of course) lower below. Here, drivers younger than 44 year old will pay more, while driver older than 44 year old will be less. We have discussed, in the introduction, the necessity of segmentation. If we consider two companies, one segmenting, while the other one has a flat rate, then older drivers will go to the first company (since insurance is cheaper) while younger ones will go to the second one (again, it is cheaper). The problem is that the second company implicitly hopes that older drivers will compensate the risk. But since they’re gone, insurance will be too cheap, and the company will loose money (if not goes bankrupt). So companies have to use segmentation techniques to survive. Now, the problem is that we cannot be sure that this exponential decay of the premium is the proper way the premium should evolve as a function of the age. An alternative can be to use nonparametric techniques to visualize to true influence of the age on claims frequency.

A pure nonparametric model

A first model can be to consider a premium, per age. This can be done considering the age of the driver as a factor in the regression,

> regglm2=glm(nbre~as.factor(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> yp=predict(regglm2,newdata=data.frame(ageconducteur=a0,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a0,yp0,type="l",ylim=c(.03,.12))
> abline(v=40,col="grey")

Here, the forecast for our 40 year old driver is slightly lower than be previous one, but the confidence interval is much larger (since we focus on a very small subclass of the portfolio: drivers aged exactly 40)

Frequency = 0.06686658  confidence interval 0.08750205 0.0462311

Here, we consider too small classes, and the premium is too erratic: the premium will decrease of 20% from age 40 to 41, and then increase of 50% from age 41 to 42,

> diff(log(yp0[23:25]))
        24         25 
-0.2330241  0.5223478

There is no chance that the company will keep the insured with this strategy. This discontinuity of the premium is clearly an important issue here.

Using age classes

An alternative can be to consider age classes, from very young drivers to senior drivers.

> level1=seq(15,105,by=5)
> regglmc1=glm(nbre~cut(ageconducteur,level1)+offset(log(exposition)),
+ data=sinistres,family=poisson)
> summary(regglmc1)

Coefficients:
                                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)                         -1.6036     0.1741  -9.212  < 2e-16 ***
cut(ageconducteur, level1)(20,25]   -0.4200     0.1948  -2.157   0.0310 *  
cut(ageconducteur, level1)(25,30]   -0.9378     0.1903  -4.927 8.33e-07 ***
cut(ageconducteur, level1)(30,35]   -1.0030     0.1869  -5.367 8.02e-08 ***
cut(ageconducteur, level1)(35,40]   -1.0779     0.1866  -5.776 7.65e-09 ***
cut(ageconducteur, level1)(40,45]   -1.0264     0.1858  -5.526 3.28e-08 ***
cut(ageconducteur, level1)(45,50]   -0.9978     0.1856  -5.377 7.58e-08 ***
cut(ageconducteur, level1)(50,55]   -1.0137     0.1855  -5.464 4.65e-08 ***
cut(ageconducteur, level1)(55,60]   -1.2036     0.1939  -6.207 5.40e-10 ***
cut(ageconducteur, level1)(60,65]   -1.1411     0.2008  -5.684 1.31e-08 ***
cut(ageconducteur, level1)(65,70]   -1.2114     0.2085  -5.811 6.22e-09 ***
cut(ageconducteur, level1)(70,75]   -1.3285     0.2210  -6.012 1.83e-09 ***
cut(ageconducteur, level1)(75,80]   -0.9814     0.2271  -4.321 1.55e-05 ***
cut(ageconducteur, level1)(80,85]   -1.4782     0.3371  -4.385 1.16e-05 ***
cut(ageconducteur, level1)(85,90]   -1.2120     0.5294  -2.289   0.0221 *  
cut(ageconducteur, level1)(90,95]   -0.9728     1.0150  -0.958   0.3379    
cut(ageconducteur, level1)(95,100] -11.4694   144.2817  -0.079   0.9366    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 

> yp=predict(regglmc1,newdata=data.frame(ageconducteur=a,exposition=1),
+ type="response",se.fit=TRUE)
> yp0=yp$fit
> yp1=yp$fit+2*yp$se.fit
> yp2=yp$fit-2*yp$se.fit
> plot(a,yp0,ylim=c(.03,.12),type="s")
> abline(v=40,col="grey")
> lines(a,yp1,lty=2,type="s")
> lines(a,yp2,lty=2,type="s")

Here we obtain the following predictions,

and for our 40 year old driver, the frequency is now 6.84%.

Frequency = 0.0684573  confidence interval 0.07766717 0.05924742

But our classes were defined arbitrarily here. Perhaps should we consider other classes, to see if the prediction is sensitive to the cutting values,

> level2=level1-2
> regglmc2=glm(nbre~cut(ageconducteur,level2)+offset(log(exposition)),
+ data=sinistres,family=poisson)

which yields the following values for our 40 year old driver,

Frequency = 0.07050614  confidence interval 0.07980422 0.06120807

So here, we did not remove the discontinuity problem. An idea here can be to consider moving regions: if the goal is to predict the frequency for a 40 year old driver, perhaps the class should be (somehow) centered around 40. And center the interval around 35 for drivers aged 35. Etc.

Moving average

Thus, it is natural to consider some local regressions, where only drivers aged almost 40 should be considered. This almost concept is related to the bandwidth. For instance, drivers between 35 and 45 can be considered as being almost40. In practice we can either consider a subset function, or we can use weights in the regressions

> value=40
> h=5
> sinistres$omega=(abs(sinistres$ageconducteur-value)<=h)*1
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

To see what’s going on, let us consider an animated plot, where the age of interest is changing,

Here, for our 40 year old drive, we get

Frequency = 0.06913391  confidence interval 0.07535564 0.06291218

We do obtain a curve that can be interpreted as a local regression. But here, we do not take into account that 35 is not as close to 40 as 39 could be. An here, 34 is assumed to be very far away from 40. Clearly, we could improve that technique: kernel functions can considered, i.e. the closer to 40, the larger the weight.

> value=40
> h=5
> sinistres$omega=dnorm(abs(sinistres$ageconducteur-value)/h)
> regglmomega=glm(nbre~ageconducteur+offset(log(exposition)),
+ data=sinistres,family=poisson,weights=omega)

which can be plotted below

Here, our prediction for our 40 year old drive is

Frequency = 0.07040464  confidence interval 0.07981521 0.06099408

This is the idea of kernel regression techniques. But as explained in the slides, other non parametric techniques can be considered, like spline functions.

Smoothing with splines

In R, it is simple to use spline function (somehow much more simple than kernel smoothers)

> library(splines)
> regglmbs=glm(nbre~bs(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction for our 40 year old driver is now

Frequency = 0.06928169  confidence interval 0.07397124 0.06459215

Note that this techniques is related to another class of models, the so-called Generalized Additive Models, i.e. GAMs.

> library(mgcv)
> reggam=gam(nbre~s(ageconducteur)+offset(log(exposition)),
+ data=sinistres,family=poisson)

The prediction is extremely close to the one we obtained above (the main differences being observed for very old drivers)

Frequency = 0.06912683  confidence interval 0.07501663 0.06323702

Comparison of the different models

Somehow, one way or another, all those models are valid. So perhaps we should compare them,

On the graph above, we can visualize the upper and the lower bound of the prediction, for the 9 models. The horizontal line is the predicted value without taking into account heterogeneity. It is possible to consider relative values, with respect to this value,

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Large claims, and ratemaking

February 13, 2013, 10:18 am

≫ Next: Visualizing overdispersion (with trees)

≪ Previous: Natura non facit saltus

During the course, we have seen that it is natural to assume that not only the individual claims frequency can be explained by some covariates, but individual costs too. Of course, appropriate families should be considered to model the distribution of the cost , given some covariates $http://latex.codecogs.com/gif.latex?\boldsymbol{X}$ .Here is the dataset we’ll use,

>  sinistre=read.table("http://freakonometrics.free.fr/sinistreACT2040.txt",
+  header=TRUE,sep=";")
>  sinistres=sinistre[sinistre$garantie=="1RC",]
>  sinistres=sinistres[sinistres$cout>0,]
>  contrat=read.table("http://freakonometrics.free.fr/contractACT2040.txt",
+  header=TRUE,sep=";")
>  couts=merge(sinistres,contrat)
> tail(couts)
     nocontrat    no garantie    cout exposition zone puissance agevehicule
1919   6104006 11933      1RC 5376.04       0.37    E         6           1
1920   6107355 12349      1RC   51.63       0.74    E         4           1
1921   6108364 13229      1RC 1320.00       0.74    B         9           1
1922   6109171 11567      1RC 1320.00       0.74    B        13           1
1923   6111208 14161      1RC  970.20       0.49    E        10           5
1924   6111650 14476      1RC 1940.40       0.48    E         4           0
     ageconducteur bonus marque carburant densite region
1919            32    57     12         E      93     10
1920            45    57     12         E      72     10
1921            32   100     12         E      83      0
1922            56    50     12         E      93     13
1923            30    90     12         E      53      2
1924            69    50     12         E      93     13

Here, each line is a claim. Usual families to model the cost are the Gamma distribution, or the inverse Gaussian. Or the lognormal distribution (which is not in the exponential family, but one can assume that the logarithm of the cost can be modeled with a Gaussian distribution). Consider here only one covariate, e.g. the age of the car, and two different models: a Gamma one, and a lognormal one.

> age=0:20
> reggamma.sp <- glm(cout~agevehicule,family=Gamma(link="log"),
+ data=couts)
> Pgamma <- predict(reggamma.sp,newdata=data.frame(agevehicule=age),type="response")

For the Gamma regression, it is a simple GLM, so it is not difficult. For a lognormal distribution, one should remember that the expected value of a lognormal distribution is not the exponential of the underlying Gaussian distribution. A correction should be made, here to get an unbiased estimator for the average cost,

> reglm.sp <- lm(log(cout)~agevehicule,data=baseCOUT)
> sigma <- summary(reglm.sp)$sigma
> mu <- predict(reglm.sp,newdata=data.frame(agevehicule=age))
> Pln <- exp(mu+sigma^2/2)

We can plot those two predictions on a single graph,

> plot(age,Pgamma,xlab="",ylab="",col="red",type="b",pch=4)
> lines(age,Pln,col="blue",type="b")

Here it is,

Observe that it is also possible to use splines, since there might be no reason for the age to appear here in a multiplicative way,

Here, the two models are rather close. Nevertheless, one should remember that the Gamma model can be extremely sensitive to large claims (I mean here really large claims). On the other hand, with the log-transformation for the lognormal model, it seams that this model is less sensitive to large events. Actually, if I use the complete dataset, the regressions are the following,

i.e. with a lognormal distribution, the average cost is decreasing with the age of the car, while it is increasing with a Gamma model. The main reason here is that there is one large (not to say huge) claim in the dataset,

> couts[which.max(couts$cout),]
         cout exposition zone puissance agevehicule ageconducteur
7842  4024601       0.22    B         9          13            19
     marque carburant densite region
7842      2         E      93     24

One young driver got a $ 4 million claim, with a 13 year old car. This is an outliers for the Gamma regression, that clearly influences the estimation (the second largest if only one third of this one). Since there is a clear influence of large claims on the estimation of the average cost, a natural idea might be to remove those large claims. Or perhaps to see them as different from normal claims: normal claims can be explained by some covariates, but perhaps that those large claims should be shared not only within its own class, but within all the insured on the portfolio. To formalize this idea, observe that we can write

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s|\boldsymbol{X})}_{B}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s,%20\boldsymbol{X})%20}_{C}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s|%20\boldsymbol{X})}_{B}}}}$

where the blue part is associated to normal-sized claims, while large ones correspond to the red part. It is then possible to run three regressions: one on normal sized claims, one on large claims, and one on the indicator of having a large claims, given that a claim occurred. The code here is something like that: a large claim – here – is above $ 10,000 (one has a fix it)

> s= 10000
> couts$normal=(couts$cout<=s)
> mean(couts$normal)
[1] 0.9818087

which represent 2% of the claims in our dataset.We can run 3 sets of regressions, with smoothed regression on the age of the car. The first one to model large claims individual costs,

> indice = which(couts$cout>s)
> mean(couts$cout[indice])
[1] 34471.59
> library(splines)
> regB=glm(cout~bs(agevehicule),data=couts,
+ subset=indice,family=Gamma(link="log"))
> ypB=predict(regB,newdata=data.frame(agevehicule=age),type="response")
> ypB2=mean(couts$cout[indice])

the second one to model normal claims individual costs,

> indice = which(couts$cout<=s)
> mean(couts$cout[indice])
[1] 1335.878
> regA=glm(cout~bs(agevehicule),data=couts,
+ subset=indice,family=Gamma(link="log"))
> ypA=predict(regA,newdata=data.frame(agevehicule=age),type="response")
> ypA2=mean(couts$cout[indice])

And finally, a third one, on the probability of having a normal sized claim, given that a claim occurred

> regC=glm(normal~bs(agevehicule),data=couts,family=binomial)
> ypC=predict(regC,newdata=data.frame(agevehicule=age),type="response")
> regC2=glm(normal~1,data=couts,family=binomial)
> ypC2=predict(regC2,newdata=data.frame(agevehicule=age),type="response")

Note that we to have, each time something that can be interpreted either as $http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X},Y\gtrless%20%20s)$ , or $http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|Y\gtrless%20%20s)$ – i.e. no covariate is considered on the later. On the graph below, we did plot

where Gamma regressions – with splines – are considered for the average costs, while logistic regressions – again with splines – are considered to model probabilities.

(but careful with splines: on borders, since we do not have a lot of observations, the behavior can be… odd. And adjustments should be made to obtain an adequate level of premium). If it is legitimate to assume that normal-sized claims can be explained by some covariates, perhaps large claims (or extremely large ones) are just purely random, i.e. not function of any covariate, at all. I.e.

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s|\boldsymbol{X})}_{B}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s)%20}_{C%27}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s|%20\boldsymbol{X})}_{B}}}}$

To go one step further, it might also be possible to assume that not only the size of the claim (given that it is a large one) is not a function of any covariate, but perhaps neither is the probability of having an extremely large claim, too

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s)}_{B%27}}}}+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s)%20}_{C%27}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s)}_{B%27}}}}$

From the first part, we’ve seen that the distribution considered had an impact on the prediction, and in the second, we’ve seen that the definition of large claims (and how to deal with them) also has an impact. So clearly, actuaries have some leverage when working on ratemaking…

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Visualizing overdispersion (with trees)

February 13, 2013, 11:14 am

≫ Next: Modeling individual losses with mixtures

≪ Previous: Large claims, and ratemaking

This week, we started to discuss overdispersion when modeling claims frequency. In my previous post, I discussed computations of empirical variances with different exposure. But I did use only one factor to compute classes. Of course, it is possible to use much more factors. For instance, using cartesian products of factors,

> X=as.factor(paste(sinistres$carburant,sinistres$zone,
+ cut(sinistres$ageconducteur,breaks=c(17,24,40,65,101))))
> E=sinistres$exposition
> Y=sinistres$nbre
> vm=vv=ve=rep(NA,length(levels(X)))
>   for(i in 1:length(levels(X))){
+  	   ve[i]=Ei=E[X==levels(X)[i]]
+  	   Yi=Y[X==levels(X)[i]]
+   vm[i]=meani=weighted.mean(Yi/Ei,Ei)    # moyenne 
+   vv[i]=variancei=sum((Yi-meani*Ei)^2)/sum(Ei)    # variance
+  cat("Class ",levels(X)[i],"average =",meani," variance =",variancei,"\n")
+ }
Class D A (17,24]  average = 0.06274415  variance = 0.06174966 
Class D A (24,40]  average = 0.07271905  variance = 0.07675049 
Class D A (40,65]  average = 0.05432262  variance = 0.06556844 
Class D A (65,101] average = 0.03026999  variance = 0.02960885 
Class D B (17,24]  average = 0.2383109   variance = 0.2442396 
Class D B (24,40]  average = 0.06662015  variance = 0.07121064 
Class D B (40,65]  average = 0.05551854  variance = 0.05543831 
Class D B (65,101] average = 0.0556386   variance = 0.0540786 
Class D C (17,24]  average = 0.1524552   variance = 0.1592623 
Class D C (24,40]  average = 0.0795852   variance = 0.09091435 
Class D C (40,65]  average = 0.07554481  variance = 0.08263404 
Class D C (65,101] average = 0.06936605  variance = 0.06684982 
Class D D (17,24]  average = 0.1584052   variance = 0.1552583 
Class D D (24,40]  average = 0.1079038   variance = 0.121747 
Class D D (40,65]  average = 0.06989518  variance = 0.07780811 
Class D D (65,101] average = 0.0470501   variance = 0.04575461 
Class D E (17,24]  average = 0.2007164   variance = 0.2647663 
Class D E (24,40]  average = 0.1121569   variance = 0.1172205 
Class D E (40,65]  average = 0.106563    variance = 0.1068348 
Class D E (65,101] average = 0.1572701   variance = 0.2126338 
Class D F (17,24]  average = 0.2314815   variance = 0.1616788 
Class D F (24,40]  average = 0.1690485   variance = 0.1443094 
Class D F (40,65]  average = 0.08496827  variance = 0.07914423 
Class D F (65,101] average = 0.1547769   variance = 0.1442915 
Class E A (17,24]  average = 0.1275345   variance = 0.1171678 
Class E A (24,40]  average = 0.04523504  variance = 0.04741449 
Class E A (40,65]  average = 0.05402834  variance = 0.05427582 
Class E A (65,101] average = 0.04176129  variance = 0.04539265 
Class E B (17,24]  average = 0.1114712   variance = 0.1059153 
Class E B (24,40]  average = 0.04211314  variance = 0.04068724 
Class E B (40,65]  average = 0.04987117  variance = 0.05096601 
Class E B (65,101] average = 0.03123003  variance = 0.03041192 
Class E C (17,24]  average = 0.1256302   variance = 0.1310862 
Class E C (24,40]  average = 0.05118006  variance = 0.05122782 
Class E C (40,65]  average = 0.05394576  variance = 0.05594004 
Class E C (65,101] average = 0.04570239  variance = 0.04422991 
Class E D (17,24]  average = 0.1777142   variance = 0.1917696 
Class E D (24,40]  average = 0.06293331  variance = 0.06738658 
Class E D (40,65]  average = 0.08532688  variance = 0.2378571 
Class E D (65,101] average = 0.05442916  variance = 0.05724951 
Class E E (17,24]  average = 0.1826558   variance = 0.2085505 
Class E E (24,40]  average = 0.07804062  variance = 0.09637156 
Class E E (40,65]  average = 0.08191469  variance = 0.08791804 
Class E E (65,101] average = 0.1017367   variance = 0.1141004 
Class E F (17,24]  average = 0           variance = 0 
Class E F (24,40]  average = 0.07731177  variance = 0.07415932 
Class E F (40,65]  average = 0.1081142   variance = 0.1074324 
Class E F (65,101] average = 0.09071118  variance = 0.1170159

Again, one can plot the variance against the average,

> plot(vm,vv,cex=sqrt(ve),col="grey",pch=19,
+ xlab="Empirical average",ylab="Empirical variance")
> points(vm,vv,cex=sqrt(ve))
> abline(a=0,b=1,lty=2)

An alternative is to use a tree. The tree can be obtained from another variable (the insured had, or had not, a claim, during the period considered) but it should be rather close to the one we would like to model (the number of claims over the period considered). Here, I did use the whole database (with more that 600,000 lines)

> library(tree)
> T=tree((nombre>0)~as.factor(zone)+as.factor(puissance)+
+ as.factor(marque)+as.factor(carburant)+as.factor(region)+
+ agevehicule+ageconducteur,data=baseFREQ,
+ split =  "gini",minsize =25000)

The tree is the following

> plot(T)
> text(T)

Now, each knot defines a class, and it is possible to use it to define a class. Which is supposed to be homogeneous.

> X=as.factor(T$where)
> E=sinistres$exposition
> Y=sinistres$nbre
> vm=vv=ve=rep(NA,length(levels(X)))
>   for(i in 1:length(levels(X))){
+  	   ve[i]=Ei=E[X==levels(X)[i]]
+  	   Yi=Y[X==levels(X)[i]]
+   vm[i]=meani=weighted.mean(Yi/Ei,Ei)    # moyenne 
+   vv[i]=variancei=sum((Yi-meani*Ei)^2)/sum(Ei)    # variance
+  cat("Class ",levels(X)[i],"average =",meani," variance =",variancei,"\n")
+  }
Class  6 average =   0.04010406  variance = 0.04424163 
Class  8 average =   0.05191127  variance = 0.05948133 
Class  9 average =   0.07442635  variance = 0.08694552 
Class  10 average =  0.4143646   variance = 0.4494002 
Class  11 average =  0.1917445   variance = 0.1744355 
Class  15 average =  0.04754595  variance = 0.05389675 
Class  20 average =  0.08129577  variance = 0.0906322 
Class  22 average =  0.05813419  variance = 0.07089811 
Class  23 average =  0.06123807  variance = 0.07010473 
Class  24 average =  0.06707301  variance = 0.07270995 
Class  25 average =  0.3164557   variance = 0.2026906 
Class  26 average =  0.08705041  variance = 0.108456 
Class  27 average =  0.06705214  variance = 0.07174673 
Class  30 average =  0.05292652  variance = 0.06127301 
Class  31 average =  0.07195285  variance = 0.08620593 
Class  32 average =  0.08133722  variance = 0.08960552 
Class  34 average =  0.1831559   variance = 0.2010849 
Class  39 average =  0.06173885  variance = 0.06573939 
Class  41 average =  0.07089419  variance = 0.07102932 
Class  44 average =  0.09426152  variance = 0.1032255 
Class  47 average =  0.03641669  variance = 0.03869702 
Class  49 average =  0.0506601   variance = 0.05089276 
Class  50 average =  0.06373107  variance = 0.06536792 
Class  51 average =  0.06762947  variance = 0.06926191 
Class  56 average =  0.06771764  variance = 0.07122379 
Class  57 average =  0.04949142  variance = 0.05086885 
Class  58 average =  0.2459016   variance = 0.2451116 
Class  59 average =  0.05996851  variance = 0.0615773 
Class  61 average =  0.07458053  variance = 0.0818608 
Class  63 average =  0.06203737  variance = 0.06249892 
Class  64 average =  0.07321618  variance = 0.07603106 
Class  66 average =  0.07332127  variance = 0.07262425 
Class  68 average =  0.07478147  variance = 0.07884597 
Class  70 average =  0.06566728  variance = 0.06749411 
Class  71 average =  0.09159605  variance = 0.09434413 
Class  75 average =  0.03228927  variance = 0.03403198 
Class  76 average =  0.04630848  variance = 0.04861813 
Class  78 average =  0.05342351  variance = 0.05626653 
Class  79 average =  0.05778622  variance = 0.05987139 
Class  80 average =  0.0374993   variance = 0.0385351 
Class  83 average =  0.06721729  variance = 0.07295168 
Class  86 average =  0.09888492  variance = 0.1131409 
Class  87 average =  0.1019186   variance = 0.2051122 
Class  88 average =  0.05281703  variance = 0.0635244 
Class  91 average =  0.08332136  variance = 0.09067632 
Class  96 average =  0.07682093  variance = 0.08144446 
Class  97 average =  0.0792268   variance = 0.08092019 
Class  99 average =  0.1019089   variance = 0.1072126 
Class  100 average = 0.1018262   variance = 0.1081117 
Class  101 average = 0.1106647   variance = 0.1151819 
Class  103 average = 0.08147644  variance = 0.08411685 
Class  104 average = 0.06456508  variance = 0.06801061 
Class  107 average = 0.1197225   variance = 0.1250056 
Class  108 average = 0.0924619   variance = 0.09845582 
Class  109 average = 0.1198932   variance = 0.1209162

Here, when ploting the empirical variance (per knot) against the empirial average of claims, we get

Here, we can identify classes where remaining heterogeneity.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Modeling individual losses with mixtures

February 15, 2013, 11:00 am

≫ Next: Modélisation des coûts individuels en tarification

≪ Previous: Visualizing overdispersion (with trees)

Usually, the sentence that I keep saying in my regression classes is “please, look at your data“. In our previous post, we’ve been playing like most econometricians: we did not look at the data. Actually, if we look at the distribution of individual losses, in the dataset, we see the following,

> n=nrow(couts)
> plot(sort(couts$cout),(1:n)/(n+1),xlim=c(0,10000),type="s",lwd=2,col="green")

It looks like there are fixed costs claims in our database. How do we deal with it in the standard case (e.g. in Loss Models textbook) ? We can use a mixture of – at least – three distributions here,

with

a distribution for small claims, $http://latex.codecogs.com/gif.latex?{\color{Blue}%20f_1(}\cdot{\color{Blue}%20)}$ , e.g. an exponential distribution
a Dirac mass in $http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\kappa}$ , i.e. $http://latex.codecogs.com/gif.latex?{\color{Magenta}%20\delta_{\kappa}(}\cdot{\color{Magenta}%20)}$
a distribution for larger claims, $http://latex.codecogs.com/gif.latex?{\color{Red}%20f_3(}\cdot{\color{Red}%20)}$ , e.g. a Gamma, or a lognormal, distribution

>  I1=which(couts$cout<1120)
>  I2=which((couts$cout>=1120)&(couts$cout<1220))
>  I3=which(couts$cout>=1220)
>  (p1=length(I1)/nrow(couts))
[1] 0.3284823
>  (p2=length(I2)/nrow(couts))
[1] 0.4152807
>  (p3=length(I3)/nrow(couts))
[1] 0.256237
>  X=couts$cout
>  (kappa=mean(X[I2]))
[1] 1171.998
>  X0=X[I3]-kappa
>  u=seq(0,10000,by=20)
>  F1=pexp(u,1/mean(X[I1]))
>  F2= (u>kappa)
>  F3=plnorm(u-kappa,mean(log(X0)),sd(log(X0))) * (u>kappa)
>  F=F1*p1+F2*p2+F3*p3
>  lines(u,F)

In our previous post, we’ve discussed the idea that all parameters might be related to some covariates, i.e.

$http://latex.codecogs.com/gif.latex?f(y|\boldsymbol{X})%20=%20p_1(\boldsymbol{X})%20{\color{Blue}%20f_1(}y|\boldsymbol{X}{\color{Blue}%20)}%20+%20p_2(\boldsymbol{X})%20{\color{Magenta}%20\delta_{\kappa}(}y{\color{Magenta}%20)}%20+%20p_3(\boldsymbol{X})%20{\color{Red}%20f_3(}y|\boldsymbol{X}{\color{Red}%20)}$

which yield the following premium model,

$http://latex.codecogs.com/gif.latex?\mathbb{E}(Y|\boldsymbol{X})%20=%20{\color{Blue}%20{\underbrace{\mathbb{E}(Y|\boldsymbol{X},Y\leq%20s_1)}_{A}%20\cdot%20{\underbrace{\mathbb{P}(Y\leq%20s_1|\boldsymbol{X})}_{D}}}}\\+{\color{Purple}%20{{\underbrace{\mathbb{E}(Y|Y\in(%20s_1,s_2],%20\boldsymbol{X})%20}_{B}}\cdot%20{\underbrace{\mathbb{P}(Y\in(%20s_1,s_2]|%20\boldsymbol{X})}_{D}}}}\\+{\color{Red}%20{{\underbrace{\mathbb{E}(Y|Y%3E%20s_2,%20\boldsymbol{X})%20}_{C}}\cdot%20{\underbrace{\mathbb{P}(Y%3E%20s_2|%20\boldsymbol{X})}_{D}}}}$

For the $http://latex.codecogs.com/gif.latex?{\color{Blue}%20A}$ , $http://latex.codecogs.com/gif.latex?{\color{Magenta}%20B}$ and $http://latex.codecogs.com/gif.latex?{\color{Red}%20C}$ terms, that’s easy, we can use standard models we’ve seen in the course. For the probability, we should use a multinomial model. Recall that for the logistic regression model, if $http://latex.codecogs.com/gif.latex?(\pi,1-\pi)=(\pi_1,\pi_2)$ , then

$http://latex.codecogs.com/gif.latex?\log%20\frac{\pi}{1-\pi}=\log%20\frac{\pi_1}{\pi_2}%20=\boldsymbol{X}%27\boldsymbol{\beta}$

i.e.

$http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta})}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}$

and

$http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta})}$

To derive a multivariate extension, write

$http://latex.codecogs.com/gif.latex?\pi_1%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}$

$http://latex.codecogs.com/gif.latex?\pi_2%20=%20\frac{\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}$

and

$http://latex.codecogs.com/gif.latex?\pi_3%20=%20\frac{1}{1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)}$

Again, maximum likelihood techniques can be used, since

$http://latex.codecogs.com/gif.latex?\mathcal{L}(\boldsymbol{\pi},\boldsymbol{y})\propto%20\prod_{i=1}^n%20\prod_{j=1}^3%20\pi_{i,j}^{Y_{i,j}}$

where here, variable $http://latex.codecogs.com/gif.latex?Y_{i}$ – which take three levels – is splitted in three indicators (like any categorical explanatory variables in standard regression model). Thus,

$http://latex.codecogs.com/gif.latex?\log%20\mathcal{L}(\boldsymbol{\beta},\boldsymbol{y})\propto%20\sum_{i=1}^n%20\sum_{j=1}^2%20\left(Y_{i,j}%20\boldsymbol{X}_i%27\boldsymbol{\beta}_j\right)%20-%20n_i\log\left[1+1+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_1)+\exp(\boldsymbol{X}%27\boldsymbol{\beta}_2)\right]$

and, as for the logistic regression, then use Newton Raphson’ algorithm to compute numerically the maximum likelihood. In R, first we have to define the levels, e.g.

> seuils=c(0,1120,1220,1e+12)
> couts$tranches=cut(couts$cout,breaks=seuils,
+ labels=c("small","fixed","large"))
> head(couts,5)
  nocontrat    no garantie    cout exposition zone puissance agevehicule
1      1870 17219      1RC 1692.29       0.11    C         5           0
2      1963 16336      1RC  422.05       0.10    E         9           0
3      4263 17089      1RC  549.21       0.65    C        10           7
4      5181 17801      1RC  191.15       0.57    D         5           2
5      6375 17485      1RC 2031.77       0.47    B         7           4
  ageconducteur bonus marque carburant densite region tranches
1            52    50     12         E      73     13    large
2            78    50     12         E      72     13    small
3            27    76     12         D      52      5    small
4            26   100     12         D      83      0    small
5            46    50      6         E      11     13    large

Then, we can run a multinomial regression, from

> library(nnet)

using some selected covariates

> reg=multinom(tranches~ageconducteur+agevehicule+zone+carburant,data=couts)
# weights:  30 (18 variable)
initial  value 2113.730043 
iter  10 value 2063.326526
iter  20 value 2059.206691
final  value 2059.134802 
converged

The output is here

> summary(reg)
Call:
multinom(formula = tranches ~ ageconducteur + agevehicule + zone + 
    carburant, data = couts)

Coefficients:
      (Intercept) ageconducteur agevehicule      zoneB      zoneC
fixed  -0.2779176   0.012071029  0.01768260 0.05567183 -0.2126045
large  -0.7029836   0.008581459 -0.01426202 0.07608382  0.1007513
           zoneD      zoneE      zoneF   carburantE
fixed -0.1548064 -0.2000597 -0.8441011 -0.009224715
large  0.3434686  0.1803350 -0.1969320  0.039414682

Std. Errors:
      (Intercept) ageconducteur agevehicule     zoneB     zoneC     zoneD
fixed   0.2371936   0.003738456  0.01013892 0.2259144 0.1776762 0.1838344
large   0.2753840   0.004203217  0.01189342 0.2746457 0.2122819 0.2151504
          zoneE     zoneF carburantE
fixed 0.1830139 0.3377169  0.1106009
large 0.2160268 0.3624900  0.1243560

To visualize the impact of a covariate (one, only), one can use also spline functions

> library(splines)
> reg=multinom(tranches~agevehicule,data=couts)
# weights:  9 (4 variable)
initial  value 2113.730043 
final  value 2072.462863 
converged
> reg=multinom(tranches~bs(agevehicule),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2070.496939
iter  20 value 2069.787720
iter  30 value 2069.659958
final  value 2069.479535 
converged

For instance, if the covariate is the age of the car, we do have the following probabilities

> predict(reg,newdata=data.frame(agevehicule=5),type="probs")
    small     fixed     large 
0.3388947 0.3869228 0.2741825

and for all ages from 0 to 20,

For instance, for new cars, the proportion of fixed costs is rather small (here in purple), and keeps increasing with the age of the car. If the covariate is the density of population in the area the driver lives, we do obtain the following probabilities

> reg=multinom(tranches~bs(densite),data=couts)
# weights:  15 (8 variable)
initial  value 2113.730043 
iter  10 value 2068.469825
final  value 2068.466349 
converged
> predict(reg,newdata=data.frame(densite=90),type="probs")
    small     fixed     large 
0.3484422 0.3473315 0.3042263

Based on those probabilities, it is then possible to derive the expected cost of a claims, given some covariates (e.g. the density). But first, define subsets of the whole dataset

> sbaseA=couts[couts$tranches=="small",]
> sbaseB=couts[couts$tranches=="fixed",]
> sbaseC=couts[couts$tranches=="large",]

with a threshold given by

> (k=mean(sousbaseB$cout))
[1] 1171.998

Then, let us run our four models,

> reg=multinom(tranches~bs(densite),data=couts)
> regA=glm(cout~bs(densite),data=sousbaseA,family=Gamma(link="log"))
> regB=glm(cout~1,data=sousbaseB,family=Gamma(link="log"))
> regC=glm((cout-k)~bs(densite),data=sousbaseC,family=Gamma(link="log"))

We can now compute predictions based on those models,

> nouveau=data.frame(densite=seq(10,100))
> proba=predict(reg,newdata=nouveau,type="probs")
> predA=predict(regA,newdata=nouveau,type="response")
> predB=predict(regB,newdata=nouveau,type="response")
> predC=predict(regC,newdata=nouveau,type="response")+k
> pred=cbind(predA,predB,predC)

To visualize the impact of each component on the premium, we can compute probabilities, are well as expected costs (given a cost in each subset),

> cbind(proba,pred)[seq(10,90,by=10),]
       small     fixed     large    predA    predB    predC
10 0.3344014 0.4241790 0.2414196 423.3746 1171.998 7135.904
20 0.3181240 0.4471869 0.2346892 428.2537 1171.998 6451.890
30 0.3076710 0.4626572 0.2296718 438.5509 1171.998 5499.030
40 0.3032872 0.4683247 0.2283881 451.4457 1171.998 4615.051
50 0.3052378 0.4620219 0.2327404 463.8545 1171.998 3961.994
60 0.3136136 0.4417057 0.2446807 472.3596 1171.998 3586.833
70 0.3279413 0.4056971 0.2663616 473.3719 1171.998 3513.601
80 0.3464842 0.3534126 0.3001032 463.5483 1171.998 3840.078
90 0.3652932 0.2868006 0.3479061 440.4925 1171.998 4912.379

Now, it is possible to plot those figures in a graph,

> barplot(t(proba*pred))
> abline(h=mean(couts$cout),lty=2)

(the dotted horizontal line is the average cost of a claim, in our dataset).

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Modélisation des coûts individuels en tarification

February 17, 2013, 7:12 pm

≫ Next: Further readings on GLMs and ratemaking

≪ Previous: Modeling individual losses with mixtures

Avant de terminer le cours sur la tarification, on va parler de la modélisation des coûts individuels. On parlera de lois Gamma et de lois lognormales (sur cette dernière, je suggère de relire ce qui avait été dit dans le cours de modèles de régression sur les modèles log-linéaires, rappelé dans un court billet publié à l’automne). On parlera aussi de mélanges de lois, et de lois multinomiales. Les transparents sont en ligne ici,

Pour aller plus loin, il y a l’article de Fu & Moncher (2004) sur la comparaison Gamma versus lognormale, http://casact.org/… ou Holler, Sommer & Trahair (1999) http://casact.org/… qui proposait un état de l’art, il y a une quinzaine d’années. Sinon, je recommande la lecture du Practitioner’s Guide to Generalized Linear Models, en ligne sur http://casact.org/….

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Semaine de relâche et données de comptage

February 22, 2013, 10:56 am

≫ Next: Données pour la régression logistique, et de Poisson

≪ Previous: Further readings on GLMs and ratemaking

Comme annoncé en cours (pour ceux qui souhaitent profiter de la semaine de relâche pour se préparer) une partie de l’examen intra portera sur la base

> base=read.table("http://freakonometrics.free.fr/baseaffairs.txt",header=TRUE)
> tail(base)
    SEX AGE YEARMARRIAGE CHILDREN RELIGIOUS EDUCATION OCCUPATION SATISFACTION Y
596   1  47         15.0        1         3        16          4            2 7
597   1  22          1.5        1         1        12          2            5 1
598   0  32         10.0        1         2        18          5            4 6
599   1  32         10.0        1         2        17          6            5 2
600   1  22          7.0        1         3        18          6            2 2
601   0  32         15.0        1         3        14          1            5 1

Il s’agit d’une base construite à partir des données de l’article A Theory of Extramarital Affairs, de Ray Fair, paru en 1978 dabs Journal of Political Economy. La variable d’intérêt est (comme son nom l’indique) Y, le nombre d’aventures extra-conjugales pendant l’année passée, avec plusieurs variables explicatives

sex: 0 pour une femme, et 1 pour un homme
age: âge de la personne interrogée
yearmarriage: nombre d’années de mariage
children: 0 si la personne n’a pas d’enfants (avec son épouse) et 1 si elle en a
religious: degré de “religiosité”, entre 1 (anti-religieuse) à 5 (très religieuse)
education: nombre d’aéées d’éducation, 9=grade school, 12=high school, …, à 20=PhD
occupation: construit suivant l’échelle d’Hollingshead (cf http://cba.uah.edu/berkowd/….)
- Higher executives of large concerns, proprietors, and major professionals (1)
- Business managers, proprietors of medium-sized businesses, and lesser professionals (2)
- Administrative personnel, owners of small businesses, and minor professionals (3)
- Clerical and sales workers, technicians, and owners of little businesses (4)
- Skilled manual employees (5)
- Machine operators and semiskilled employees (6)
- Unskilled employees (7)
satisfaction: perception de son mariage, de très mécontente (1) à très contente (5)

Je ne répondrais pas, a priori, aux questions portant sur ces données. Bon courage, et bonne semaine de relâche.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Données pour la régression logistique, et de Poisson

March 4, 2013, 6:58 am

≫ Next: Readings on IBNR and claims reserving

≪ Previous: Semaine de relâche et données de comptage

Pour le cours de mercredi, deux petites bases, pour se pratiquer à modéliser des variable 0/1 ou une variable de comptage,

> base = read.table("http://freakonometrics.free.fr/base-glm-act2040.txt",
+ header=TRUE)

ou encore

> base = read.table("http://freakonometrics.free.fr/base-pratique-act2040.txt",
+ header=TRUE)

Sinon, une base plus complète pour faire de la tarification,

> BASEN=read.table("http://freakonometrics.free.fr/baseN.txt",header=TRUE,sep=";")
> BASEY=read.table("http://freakonometrics.free.fr/baseY.txt",header=TRUE,sep=";")
> head(BASEN)
ageconducteur agepermis sexeconducteur situationfamiliale  habitation zone
1            57        39              F             Celiba peri-urbain    8
2            54        35              H             Celiba      urbain    3
3            51        32              F             Celiba      urbain    1
4            53        35              H              Marie       rural    4
5            61        43              H              Marie      urbain    8
6            60        29              F              Marie peri-urbain    1
agevehicule proprietaire    payment  marque         poids     usage
1          12    locataire     Annuel  AUTRES     8.>3500kg PROMENADE
2          20     sans mrp Semestriel PEUGEOT 4.3100-3199kg PROMENADE
3           4     sans mrp     Annuel  RAPIDO     1.<2700kg PROMENADE
4           1     sans mrp     Annuel  AUTRES 3.3000-3099kg PROMENADE
5           1 proprietaire     Annuel    FIAT 6.3300-3399kg PROMENADE
6          10     sans mrp    Mensuel    FIAT     8.>3500kg PROMENADE
exposition nombre   voiture
1          1      0 Monospace
2          1      0   Berline
3          1      0  sans avp
4          1      0  sans avp
5          1      1 Monospace
6          1      0  sans avp

Parmi les variables, la description (sommaire) est la suivante,

ageconducteur: âge du conducteur principal du véhicule
agepermis: ancienneté du permis de conduire du conducteur principal du véhicule
sexeconducteur: sexe du conducteur principal (H ou F)
situationfamiliale: situation familiale du conducteur principal (“Celiba”, “Marie” ou “Veuf/Div”)
habitation: zone d’habitation du conducteur principal (“peri-urbain”, “rural” ou “urbain” )
zone: zone d’habitation (allant de 1 à 8)
agevehicule: age du véhicule
proprietaire: si le conducteur principal possède un contrat Habitation, son statut (“locataire” ou “proprietaire”) Sinon “sans mrp”
payment:type de fractionnement de la prime d’assurance automobile (“Annuel”, “Mensuel” ou “Semestriel”)
marque: marque du véhicule

> levels(BASEN[,10])
[1] "ADRIA"       "AUTOSTAR"    "AUTRES"      "BURSTNER MOBIL"
[5] "CHALLENGER"  "CHAUSSON"    "CITROEN"     "FIAT"
[9] "FORD"        "HYMERMOBIL"  "MERCEDES"    "PEUGEOT"
[13] "PILOTE"     "RAPIDO"      "RENAULT"     "VOLKSWAGEN"

poids: classe de poids du véhicule

> levels(BASEN[,11])
[1] "1.<2700kg"    "2.2700-2999kg""3.3000-3099kg""4.3100-3199kg"
[5] "5.3200-3299kg""6.3300-3399kg""7.3400-3499kg""8.>3500kg"

usage: utilisation du véhicule principal (“PROMENADE” ou “TOUS_DEPLACEMENTS”)
exposition: exposition, en années
nombre: nombre d’accident responsabilité civile du conducteur principal, pendant l’année passée
cout: cout du sinistre
voiture: type de véhicule

> levels(BASEN[,15])
[1] "Berline"            "Break"              "Buggy"
[4] "Cabriolet"          "Combispace"         "Coup\xe9"
[7] "Coup\xe9 Cabriolet" "Jeep"               "Minibus"
[10] "Minispace"          "Monospace"         "sans avp"

La variable d’intérêt est ici le nombre d’accident,

> table(BASEN$nombre)

    0     1 
60155  3264

La base est un peu particulière – on en parlera en classe – les assurés ayant eu 0 ou 1 accident dans l’année.

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Readings on IBNR and claims reserving

March 4, 2013, 8:18 am

≫ Next: Multiple (smoothed) regression and portfolio exposure

≪ Previous: Données pour la régression logistique, et de Poisson

The second part of the course on nonlife insurance will be dedicated to IBNR and claims reserving techniques. The main reference is the textbook by Mario Wüthrich and Michael Merz (a preliminary version can be downloaded from http://actuaries.ch/…)

The first reference is Best Estimates for Reserves by by Glen Barnett and Ben Zehnwirth is available online http://casact.org/pubs/…. In 2004 , Ben Zehnwirth, Julie Sims and Mark Shapland published Will Your Next Reserve Increase Be Your Last, available on http://contingencies.org/janfeb04/…. Finally, on simulation based techniques, The Actuary published an article entitled about the bootstrap, http://insureware.com/Library/… For further readings, here are some articles, found in the CAS forums, the ASTIN conferences, etc,

The Chain Ladder and Tweedie Distributed Claims Data by Greg Taylor
Chain-Ladder Bias: Its Reason and Meaning by Leigh Joseph Halliwell
Estimating Predictive Distributions for Loss Reserve Models by Glenn Meyers
Using a Bayesian Approach for Claims Reserving by Mario Wuthrich
Obtaining Predictive Distributions for Reserves Which Incorporate Expert Opinion by Richard Verrall
Loss Reserve Estimates: A Statistical Approach for Determining “Reasonableness” by Mark Shapland
Munich Chain Ladder: A Reserving Method that Reduces the Gap between IBNR Projections Based on Paid Losses and IBNR Projections Based on Incurred Losses by Gerhard Quarg and Thomas Mack
The Bornhuetter-Ferguson Principle by Klaus Schmidt and Mathias Zocher
On the Importance of Dispersion Modeling for Claims Reserving: An Application with the Tweedie Distribution by Jean-Philippe Boucher and Danail Davidov
Quantifying Uncertainty in Reserve Estimates by Zia Rehman and Stuart Klugman
Robustifying Reserving by Gary Venter, Dumaria Rulina Tampubolon, see also Robustifying Reserving
Bootstrap Estimation of the Predictive Distributions of Reserves Using Paid and Incurred Claims by Huijuan Liu and Richard Verrall
Predictive Distributions for Reserves Which Separate True IBNR and IBNER Claims by Huijuan Liu and Richard Verrall
The Retrospective Testing of Stochastic Loss Reserve Models, by Glenn Meyers
A GLM-Based Approach to Adjusting for Changes in Case Reserve Adequacy by Larry Decker
A Method for Modelling Varying Run-off Evolutions in Claims Reserving by Richard Verrall
A Nonlinear Regression Model of Incurred But Not Reported Losses by Scott Stelljes
Back-Testing the ODP Bootstrap of the Paid Chain-Ladder Model with Actual Historical Claims Data by Jessica (Weng Kah) Leong, Shaun Wang and Han Chen
A Practical Way to Estimate One-Year Reserve Risk by Ira Robbin
Closed-Form Distribution of Prediction Uncertainty in Chain Ladder Reserving by Bayesian Approach by Ji Yao
The Prediction Error of Bornhuetter-Ferguson by Thomas Mack
Stochastic Loss Reserving with the Collective Risk Model by Glenn Meyers
On the Accuracy of Loss Reserving Methodology Tapio Boles and Andy Staudt
Bootstrapping Generalized Linear Models for Development Triangles Using Deviance Residuals by Thomas Hartl
Fitting a GLM to Incomplete Development Triangles by Thomas Hartl
Gauss—Markov Loss Prediction in a Linear Model by Alexander Ludwig and Klaus Schmidt
Anatomy of Actuarial Methods of Loss Reserving by Prakash Narayan
Bootstrap Modeling: Beyond the Basics by Mark R. Shapland and Jessica (Weng Kah) Leong
GLM Invariants by Fred Klinker
The Retrospective Testing of Stochastic Loss Reserve Models by Glenn Meyers and Peng Shi
A GLM-Based Approach to Adjusting for Changes in Case Reserve Adequacy by Larry Decker
Testing the Assumptions of Age-to-age Factors, by Gary Venter, see also http://rationalargumentator.com/actuaryguide/…

Arthur Charpentier

More Posts - Website

Follow Me:

↧

Multiple (smoothed) regression and portfolio exposure

March 9, 2013, 2:16 am

≫ Next: Triangles et provisionnement

≪ Previous: Readings on IBNR and claims reserving

Wednesday, in class, we’ve seen how to visualize a multiple regression model (with two continuous explanatory variables). Here, the goal is to predict the average cost of an insurance claim, using some covariates, e.g. the age of the driver, and the age of the car (recall that losses here are liability losses). The prediction obtained from a (standard) generalized linear model, with a log-link

> reg1=glm(cout~ageconducteur+agevehicule,data=base,family=Gamma(link="log"))

The code to visualize the predicted average cost is the following: first, we have to compute predictions for specific values,

> pred=function(x,y){
+ predict(reg,newdata=data.frame(ageconducteur=x,
+ agevehicule=y),type="response")

Then, we use this function to compute values on a grid,

> X=seq(20,80,by=5)
> Y=0:20
> Z=outer(X,Y,p)
> image(X,Y,Z,col=rev(heat.colors(101)))
> contour(X,Y,Z,add=TRUE,
+ levels=c(1400,1800,2000,2200,2400,2600,2800,3000,3200,4000,5000))

If we use factors, and not continuous variates (cut versions of those two variates),

> reg2=glm(cout~cut(ageconducteur,breaks=c(0,22,35,55,80,100))*
+               cut(agevehicule,breaks=c(-1,1,3,5,10,100)),
+ data=base,family=Gamma(link="log"))

(note that we consider the Cartesian product, so values are computed for each product of factors, age of the driver and age of the car) we obtain

Obviously, we’re missing something here: the most expensive class with one model is the cheapeast for the other one! Of course, it might come from our classes (that were chosen a bit randomly), but it might be interesting to use nonlinear functions of the ages. So, let us use splines to smooth those two variables,

> reg3=glm(cout~bs(ageconducteur)+bs(agevehicule),data=base,
+ family=Gamma(link="log"))

With additive smoothed functions, we obtained a symmetric graph (due to the additive property)

while with a bivariate spline

> library(mgcv)
+ reg4=gam(cout~s(ageconducteur,agevehicule),data=base,
+ family=Gamma(link="log"))

(for some odd reasons, I could not use – easily – a bivariate spline in the Generalized Linear Model, but it did work considering a Generalized Additive Model – which is, by no means additive now). We can identify here some regions where the average cost can be extremely expensive… But, as mentioned wednesday, one should keep in mind that some parts of the square above are not reached. More precisely, the distribution of the portfolio, as a function of those two covariates is the following

Thus, the proportion of young drivers driving a brand new car, and the proportion of old drivers driving a very old car is rather small… If the goal is to find niches, one should look at the prediction more carefully, but if the goal is to make that everyone gets an insurance cover, maybe we should allow that some drivers are under-priced (especially when are rare in the portfolio). And one should keep in mind that average costs are extremely sensitive to large losses, as discussed previously http://freakonometrics.hypotheses.org/3490 (and in class)

In the univariate case, I have migrated an old post, we I tried to reproduce (in R and in French) some standard graphs in the insurance industry: it is always interesting to visualize not only the prediction obtained from our models, but also the size of each class in the portfolio,

The post is online here http://freakonometrics.hypotheses.org/1224

Arthur Charpentier

More Posts - Website

Follow Me:

↧