Some heuristics about local regression and kernel smoothing

October 8, 2013, 6:53 pm

≫ Next: Some heuristics about spline smoothing

≪ Previous: Regression on variables, or on categories?

In a standard linear model, we assume that . Alternatives can be considered, when the linear assumption is too strong.

Polynomial regression

A natural extension might be to assume some polynomial function,

Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters can be obtained using least squares, where a regression of on is considered.

Even if this polynomial model is not the real one, it might still be a good approximation for . Actually, from Stone-Weierstrass theorem, if is continuous on some interval, then there is a uniform approximation of by polynomial functions.

Just to illustrate, consider the following (simulated) dataset

set.seed(1)
n=10
xr = seq(0,n,by=.1)
yr = sin(xr/2)+rnorm(length(xr))/2
db = data.frame(x=xr,y=yr)
plot(db)

with the standard regression line

reg = lm(y ~ x,data=db)
abline(reg,col="red")

Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,

reg=lm(y~poly(x,5),data=db)

But if the degree is too large, then too many ‘oscillations’ are obtained,

reg=lm(y~poly(x,25),data=db)

and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes

plot(db)
attach(db)
lines(xr,predict(reg),col="red",lty=2)
yrm=yr;yrm[31]=yr[31]-2 
regm=lm(yrm~poly(xr,25)) 
lines(xr,predict(regm),col="red")

Local regression

Actually, if our interest is to have locally a good approximation of , why not use a local regression?

This can be done easily using a weighted regression, where, in the least square formulation, we consider

(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:

here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is

which can be solve explicitly, since

so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point , then should be proportional to some distance between and : if is too far from , then it should not have to much influence on the prediction.

For instance, if we want to have a prediction at some point , consider . With this model, we remove observations too far away,

Actually, here, it is the same as

reg=lm(yr~xr,subset=which(abs(xr-x0)<1)

A more general idea is to consider some kernel function that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that

This is actually the so-called Nadaraya-Watson estimator of function .
In the previous case, we did consider a uniform kernel , with bandwith ,

But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,

This can be done using

fitloc0 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~1,data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

On our dataset, we can plot

ul=seq(0,10,by=.01)
vl0=Vectorize(fitloc0)(ul)
u0=seq(-2,7,by=.01)
linearlocalconst=function(x0){
w=dnorm((xr-x0))
plot(db,cex=abs(w)*4)
lines(ul,vl0,col="red")
axis(3)
axis(2)
reg=lm(y~1,data=db,weights=w)
u=seq(0,10,by=.02)
v=predict(reg,newdata=data.frame(x=u))
lines(u,v,col="red",lwd=2)
abline(v=c(0,x0,10),lty=2)
}
linearlocalconst(2)

Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression

Let us use an animation to visualize the construction of the curve. One can use

library(animate)

but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs

vx0=seq(1,9,by=.1)
vx0=c(vx0,rev(vx0))
graphloc=function(i){
name=paste("local-reg-",100+i,".png",sep="")
png(name,600,400)
linearlocalconst(vx0[i])
dev.off()}

for(i in 1:length(vx0)) graphloc(i)

and then, in a terminal, I simply use

    convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif

Of course, it is possible to consider a linear model, locally,

fitloc1 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=1),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

or even a quadratic (local) regression,

fitloc2 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=2),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

Of course, we can change the bandwidth

To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider

where is the prediction obtained using a local regression technique, with bandwidth . And to get a more accurate (and optimal) bandwith is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…

Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.

library(XML)
library(downloader)
file = "geos-tww.csv"
html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all")
html = xpathApply(html, "//table[@id='collectionTable']")[[1]]
data = readHTMLTable(html)
data = data[,-3]
names(data)=c("no",names(data)[-1])
data=data[-(61:64),]

Let us reshape the dataset,

data$no = 1:96
data$mu = as.numeric(substr(as.character(data$Mean), 0, 4))
data$se =  sd(data$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data$Count)))
data$season = 1 + (data$no - 1)%/%12
data$season = factor(data$season)
plot(data$no,data$mu,ylim=c(6,10))
segments(data$no,data$mu-1.96*data$se,
data$no,data$mu+1.96*data$se,col="light blue")

As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s)
lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }

Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.

db = data
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=5)
plot(data$no,data$mu)
lines(NW,col="red")

We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,

db$mu[95]=7
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=12) 
plot(data$no,data$mu,ylim=c(6,10)) 
lines(NW,col="red")

↧

Some heuristics about spline smoothing

October 8, 2013, 7:55 pm

≫ Next: Introduction aux GLM

≪ Previous: Some heuristics about local regression and kernel smoothing

Let us continue our discussion on smoothing techniques in regression. Assume that . where is some unkown function, but assumed to be sufficently smooth. For instance, assume that is continuous, that exists, and is continuous, that exists and is also continuous, etc. If is smooth enough, Taylor’s expansion can be used. Hence, for $http://latex.codecogs.com/gif.latex?x\in(\alpha,\beta)$

which can also be writen as

for some ‘s. The first part is simply a polynomial.

The second part, is some integral. Using Riemann integral, observe that

for some ‘s, and some

Thus,

Nice! We have our linear regression model. A natural idea is then to consider a regression of on $http://latex.codecogs.com/gif.latex?\boldsymbol{X}$ where

$http://latex.codecogs.com/gif.latex?\boldsymbol{X}%20=%20(1,X,X^2,\cdots,X^d,(X-x_1)_+^d,\cdots,(X-x_k)_+^d%20)$

given some knots $http://latex.codecogs.com/gif.latex?\{x_1,\cdots,x_k\}$ . To make things easier to understand, let us work with our previous dataset,

plot(db)

If we consider one knot, and an expansion of order 1,

attach(db)
library(splines)
B=bs(xr,knots=c(3),Boundary.knots=c(0,10),degre=1)
reg=lm(yr~B)
lines(xr[xr<=3],predict(reg)[xr<=3],col="red")
lines(xr[xr>=3],predict(reg)[xr>=3],col="blue")

The prediction obtained with this spline can be compared with regressions on subsets (the doted lines)

reg=lm(yr~xr,subset=xr<=3)
lines(xr[xr<=3],predict(reg)[xr<=3],col="red",lty=2)
reg=lm(yr~xr,subset=xr>=3)
lines(xr[xr>=3],predict(reg),col="blue",lty=2)

It is different, since we have here three parameters (and not four, as for the regressions on the two subsets). One degree of freedom is lost, when asking for a continuous model. Observe that it is possible to write, equivalently

reg=lm(yr~bs(xr,knots=c(3),Boundary.knots=c(0,10),degre=1),data=db)

So, what happened here?

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=1)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

Here, the functions that appear in the regression are the following

Now, if we run the regression on those two components, we get

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=1)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

If we add one knot, we get

the prediction is

reg=lm(yr~B)
lines(xr,predict(reg),col="red")

Of course, we can choose much more knots,

B=bs(xr,knots=1:9,Boundary.knots=c(0,10),degre=1)
reg=lm(yr~B)
lines(xr,predict(reg),col="red")

We can even get a confidence interval

reg=lm(yr~B)
P=predict(reg,interval="confidence")
plot(db,col="white")
polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3])),col="light blue",border=NA)
points(db)
reg=lm(yr~B)
lines(xr,P[,1],col="red")
abline(v=c(0,2,5,10),lty=2)

And if we keep the two knots we chose previously, but consider Taylor’s expansion of order 2, we get

B=bs(xr,knots=c(2,5),Boundary.knots=c(0,10),degre=2)
matplot(xr,B,type="l")
abline(v=c(0,2,5,10),lty=2)

So, what’s going on? If we consider the constant, and the first component of the spline based matrix, we get

k=2
plot(db)
B=cbind(1,B)
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

If we add the constant term, the first term and the second term, we get the part on the left, before the first knot,

k=3
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

and with three terms from the spline based matrix, we can get the part between the two knots,

k=4
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

and finallty, when we sum all the terms, we get this time the part on the right, after the last knot,

k=5
lines(xr,B[,1:k]%*%coefficients(reg)[1:k],col=k-1,lty=k-1)

This is what we get using a spline regression, quadratic, with two (fixed) knots. And can can even get confidence intervals, as before

reg=lm(yr~B)
P=predict(reg,interval="confidence")
plot(db,col="white")
polygon(c(xr,rev(xr)),c(P[,2],rev(P[,3])),col="light blue",border=NA)
points(db)
reg=lm(yr~B)
lines(xr,P[,1],col="red")
abline(v=c(0,2,5,10),lty=2)

The great idea here is to use functions , that will insure continuity at point .

Of course, we can use those splines on our Dexter application,

Here again, using linear spline function, it is possible to impose a continuity constraint,

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
reg=lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),
degre=1),data=db)
lines(c(1:94,96),predict(reg),col="red")

But we can also consider some quadratic splines,

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
reg=lm(mu~bs(no,knots=c(12*(1:7)+.5),Boundary.knots=c(0,97),
degre=2),data=db)
lines(c(1:94,96),predict(reg),col="red")

↧

Introduction aux GLM

October 8, 2013, 9:00 pm

≫ Next: Surdispersion et comptage

≪ Previous: Some heuristics about spline smoothing

Cette semaine, on finit la régression de Poisson (temporairement) avant de présenter la théorie des GLM. Les transparents sont en ligne. On en aura besoin pour aller plus loin sur les modèles avec surdispersion, pour modéliser la fréquence de sinistre, mais aussi pour modéliser les coûts.

↧

Surdispersion et comptage

October 15, 2013, 10:13 am

≫ Next: Modélisation des coûts individuels

≪ Previous: Introduction aux GLM

Cette semaine, au cours d’assurance non-vie, on abordera la surdispersion, qui clôturera la partie du cours sur la modélisation de la fréquence de sinistres. Les transparents sont en ligne. Mais avant de parler de surdispersion, on finira la présentation des GLM. Je mets un lien vers le chapitre 15 du livre de John Fox Applied regression analysis and generalized linear models ainsi que le livre de James K. Lindsey Applying Generalized Linear Models. Je voudrais aussi renvoyer vers les notes de cours de Germán Rodríguez, avec des notes sur la régression de Poisson (avec un petit complément sur la notion de overdispersion).

Les instructions pour le second devoir seront envoyées par courriel.

↧

Modélisation des coûts individuels

October 22, 2013, 4:13 pm

≫ Next: GLM, non-linearity and heteroscedasticity

≪ Previous: Surdispersion et comptage

Cette semaine, même si le réseau de l’UQAM est down, on va continuer le cours et finir la section sur la modélisation de la surdispersion pour la fréquence de sinistres. On devrait ensuite commencer la modélisation des coûts individuels. En particulier, on passera du temps autour de deux points,

la distinction lognormale et gamma
l’écrêtement des gros sinistres

Les transparents sont en ligne. Et la base des coûts est celle évoquée au second cours.

↧

GLM, non-linearity and heteroscedasticity

October 22, 2013, 5:00 pm

≫ Next: Pricing Reinsurance Contracts

≪ Previous: Modélisation des coûts individuels

Last week in the non-life insurance course, we’ve seen the theory of the Generalized Linear Models, emphasizing the two important components

the link function (which is actually the key component in predictive modeling)
the distribution, or the variance function

Just to illustrate, consider my favorite dataset

lin.mod = lm(dist~speed,data=cars)

A linear model means here

where the residuals are assumed to be centered, independent, and with identical variance. If we visualize that linear regression, we usually see something like that

The idea here (in GLMs) is to assume

which will produce the same model as the one describe previously, based on some error term. That model can be visualized below,

attach(cars)
n=2
X= cars$speed 
Y=cars$dist
df=data.frame(X,Y)
vX=seq(min(X)-2,max(X)+2,length=n)
vY=seq(min(Y)-15,max(Y)+15,length=n)
mat=persp(vX,vY,matrix(0,n,n),zlim=c(0,.1),theta=-30,ticktype ="detailed", box = FALSE)
reggig=glm(Y~X,data=df,family=gaussian(link="identity"))
x=seq(min(X),max(X),length=501)
C=trans3d(x,predict(reggig,newdata=data.frame(X=x),type="response"),rep(0,length(x)),mat)
lines(C,lwd=2)
sdgig=sqrt(summary(reggig)$dispersion)
x=seq(min(X),max(X),length=501)
y1=qnorm(.95,predict(reggig,newdata=data.frame(X=x),type="response"), sdgig)
C=trans3d(x,y1,rep(0,length(x)),mat)
lines(C,lty=2)
y2=qnorm(.05,predict(reggig,newdata=data.frame(X=x),type="response"), sdgig)
C=trans3d(x,y2,rep(0,length(x)),mat)
lines(C,lty=2)
C=trans3d(c(x,rev(x)),c(y1,rev(y2)),rep(0,2*length(x)),mat)
polygon(C,border=NA,col="yellow")
C=trans3d(X,Y,rep(0,length(X)),mat)
points(C,pch=19,col="red")
n=8
vX=seq(min(X),max(X),length=n)
mgig=predict(reggig,newdata=data.frame(X=vX))
sdgig=sqrt(summary(reggig)$dispersion)
for(j in n:1){
stp=251
x=rep(vX[j],stp)
y=seq(min(min(Y)-15,qnorm(.05,predict(reggig,newdata=data.frame(X=vX[j]),type="response"), sdgig)),max(Y)+15,length=stp)
z0=rep(0,stp)
z=dnorm(y, mgig[j], sdgig)
C=trans3d(c(x,x),c(y,rev(y)),c(z,z0),mat)
polygon(C,border=NA,col="light blue",density=40)
C=trans3d(x,y,z0,mat)
lines(C,lty=2)
C=trans3d(x,y,z,mat)
lines(C,col="blue")}

We do have two parts here: the linear increase of the average, and the constant variance of the normal distribution .

On the other hand, if we assume a Poisson regression,

poisson.reg = glm(dist~speed,data=cars,family=poisson(link="log"))

we have something like

This time, two things have changed simultaneously: our model is no longer linear, it is an exponential one , and the variance is also increasing with the explanatory variable , since with a Poisson regression,

If we adapt the previous code, we get

The problem is that we changed two things when we introduced the Poisson regression from the linear model. So let us look at what happens when we change the two components independently. First, we can change the link function, with a Gaussian model but this time a multiplicative model (with a logarithm link function)

gaussian.reg = glm(dist~speed,data=cars,family=gaussian(link="log"))

which is still, here, an homoscedasctic model, but this time non-linear. Or we can change the link function in the Poisson regression, to get a linear model, but heteroscedastic

poisson.lin = glm(dist~speed,data=cars,family=poisson(link="identity"))

So this is basically what GLMs are about….

↧

Pricing Reinsurance Contracts

October 24, 2013, 5:08 pm

≫ Next: Réassurance

≪ Previous: GLM, non-linearity and heteroscedasticity

In order to illustrate the next section of the non-life insurance course, consider the following example¹, inspired from http://sciencepolicy.colorado.edu/…. This is the so-called “Normalized Hurricane Damages in the United States” dataset, for the period 1900-2005, from Pielke et al. (2008). The dataset is available in xls format, so we have to spend some time to import it,

> library(gdata)
> db=read.xls(
+ "http://sciencepolicy.colorado.edu/publications/special/public_data_may_2007.xls",
+ sheet=1)
trying URL 'http://sciencepolicy.colorado.edu/publications/special/public_data_may_2007.xls'

Content type 'application/vnd.ms-excel' length 119296 bytes (116 Kb)
opened URL
==================================================
downloaded 116 Kb

perl: warning: Setting locale failed.
perl: warning: Please check that your locale settings:
	LANGUAGE = "fr_CA:fr",
	LC_ALL = (unset),
	LANG = "fr_CA.UTF-8"
    are supported and installed on your system.
perl: warning: Falling back to the standard locale ("C").

The problem with excel spreadsheets is that some columns might have pre-specified format (here, losses are with a format 000,000,000 for instance)

> tail(db)
    Year Hurricane.Description State Category Base.Economic.Damage
202 2005                 Cindy    LA        1          320,000,000
203 2005                Dennis    FL        3        2,230,000,000
204 2005               Katrina LA,MS        3       81,000,000,000
205 2005               Ophelia    NC        1        1,600,000,000
206 2005                  Rita    TX        3       10,000,000,000
207 2005                 Wilma    FL        3       20,600,000,000
    Normalized.PL05 Normalized.CL05  X X.1
202     320,000,000     320,000,000 NA  NA
203   2,230,000,000   2,230,000,000 NA  NA
204  81,000,000,000  81,000,000,000 NA  NA
205   1,600,000,000   1,600,000,000 NA  NA
206  10,000,000,000  10,000,000,000 NA  NA
207  20,600,000,000  20,600,000,000 NA  NA

To get data in a format we can play with, consider the following function,

> stupidcomma = function(x){
+ x=as.character(x)
+ for(i in 1:10){x=sub(",","",as.character(x))}
+ return(as.numeric(x))}

and let’s convert those values into numbers,

> base=db[,1:4]
> base$Base.Economic.Damage=Vectorize(stupidcomma)(db$Base.Economic.Damage)
> base$Normalized.PL05=Vectorize(stupidcomma)(db$Normalized.PL05)
> base$Normalized.CL05=Vectorize(stupidcomma)(db$Normalized.CL05)

Here is the dataset we will use, from now on,

> tail(base)
    Year Hurricane.Description State Category Base.Economic.Damage
202 2005                 Cindy    LA        1             3.20e+08
203 2005                Dennis    FL        3             2.23e+09
204 2005               Katrina LA,MS        3             8.10e+10
205 2005               Ophelia    NC        1             1.60e+09
206 2005                  Rita    TX        3             1.00e+10
207 2005                 Wilma    FL        3             2.06e+10
    Normalized.PL05 Normalized.CL05
202        3.20e+08        3.20e+08
203        2.23e+09        2.23e+09
204        8.10e+10        8.10e+10
205        1.60e+09        1.60e+09
206        1.00e+10        1.00e+10
207        2.06e+10        2.06e+10

We can visualize the normalized costs of hurricanes, from 1900 till 2005, with the 207 hurricanes (here the x-axis is not time, it is simply the index of the loss)

> plot(base$Normalized.PL05/1e9,type="h",ylim=c(0,155))

As usual, there are two components when computing the pure premium of an insurance contract. The number of claims (or here hurricanes) and the individual losses of each claim. We’ve seen – above – individual losses, let us focus now on the annual frequency.

> TB <- table(base$Year)
> years <- as.numeric(names(TB))
> counts <- as.numeric(TB)
> years0=(1900:2005)[which(!(1900:2005)%in%years)]
> db <- data.frame(years=c(years,years0),
+ counts=c(counts,rep(0,length(years0))))
> db[88:93,]
   years counts
88  2003      3
89  2004      6
90  2005      6
91  1902      0
92  1905      0
93  1907      0

On average, we experience about 2 (major) hurricanes per year,

> mean(db$counts)
[1] 1.95283

In predictive modeling (here, we wish to price a reinsurance contract for, say, 2014), we need probably to take into account some possible trend in the hurricane occurrence frequency. We can consider either a linear trend,

> reg0 <- glm(counts~years,data=db,family=poisson(link="identity"),
+ start=lm(counts~years,data=db)$coefficients)

or an exponential one,

> reg1 <- glm(counts~years,data=db,family=poisson(link="log"))

We can plot those three predictions, and get a prediction for the number of (major) hurricanes in 2014,

> plot(years,counts,type='h',ylim=c(0,6),xlim=c(1900,2020))
> cpred1=predict(reg1,newdata=data.frame(years=1890:2030),type="response")
> lines(1890:2030,cpred1,col="blue")
> cpred0=predict(reg0,newdata=data.frame(years=1890:2030),type="response")
> lines(1890:2030,cpred0,col="red")
> abline(h=mean(db$counts),col="black")
> (predictions=cbind(constant=mean(db$counts),linear=
+ cpred0[126],exponential=cpred1[126]))
    constant   linear exponential
126  1.95283 3.573999    4.379822
> points(rep((1890:2030)[126],3),prediction,col=c("black","red","blue"),pch=19)

Observe that changing the model will change the pure premium: with a flat prediction, we expect less than 2 (major) hurricanes, but with the exponential trend, we expect more than 4…

This is for the expected frequency. Now, we should find a suitable model to compute the pure premium of a reinsurance treaty, with a (high) deductible, and a limited (but large) cover. As we will seen in class next week, the appropriate model is a Pareto distribution (see Hagstrœm (1925), Huyghues-Beaufond (1991) or a survey – in French – published a few years ago).

We can use Hill’s plot to estimate the tail index,

> library(evir)
> hill(base$Normalized.PL05)

Clearly, costs of major hurricanes are heavy tailed.

Now, consider an insurance company, in the U.S., with 5% market share (just to illustrate). We will consider there \tilde Y_i= Y_i/20. The losses are given below. Consider a reinsurance treaty, with a deductible of 2 (billion) and a limited cover of 4 (billion),

For our Pareto model, consider only losses above 500 millions,

> threshold=.5
> (gpd.PL <- gpd(base$Normalized.PL05/1e9/20,threshold)$par.ests)
       xi      beta 
0.4424669 0.6705315

Keep in mind the 1 hurricane out of 8 reaches that level

> mean(base$Normalized.CL05/1e9/20>.5)
[1] 0.1256039

Given that the loss exceeds 500 millions, we can now compute the expected value of the reinsurance contact,

To compute it we can use

> E <- function(yinf,ysup,xi,beta){
+   as.numeric(integrate(function(x) (x-yinf)*dgpd(x,xi,mu=threshold,beta),
+   lower=yinf,upper=ysup)$value+
+   (1-pgpd(ysup,xi,mu=threshold,beta))*(ysup-yinf))
+ }

[Nov 5th] there is a typo in the previous function, since the threshold should be used, here, as a parameter in the function, if you want to play with that function an see the impact of the threshold (see a more recent post on the same topic, but a different dataset)… but here, we do not change the threshold, so it is not a big deal.

Now, it is probably time to bring all the pieces together. We might expect a bit less than 2 (major) hurricanes per year,

> predictions[1]
[1] 1.95283

and each hurricane has 12.5% chances to cost more than 500 million for our insurance company,

> mean(base$Normalized.PL05/1e9/20>.5)
[1] 0.1256039

and given that an hurricane exceeds 500 million loss, then the expected repayment by the reinsurance company is (in millions)

> E(2,6,gpd.PL[1],gpd.PL[2])*1e3
[1] 330.9865

So the pure premium of the reinsurance contract is simply

> predictions[1]*mean(base$Normalized.PL05/1e9/20>.5)*
+ E(2,6,gpd.PL[1],gpd.PL[2])*1e3
[1] 81.18538

for a cover of 4 billion, in excess of 2.

^{1.This example will be found in the Reinsurance and Extremal Events chapter in the forthcoming Computational Actuarial Science with R, by Eric Gilleland and Mathieu Ribatet.}

↧

Réassurance

October 29, 2013, 12:02 am

≫ Next: More significant? so what…

≪ Previous: Pricing Reinsurance Contracts

Mercredi, on finira la modélisation des coûts individuels de sinistres en évoquant la mutualisation. Si on a le temps, on parlera aussi de réassurance. Les transparents sont en ligne.

Sinon, histoire d’illustrer les aspects pratiques de la tarification, j’utiliserais peut-être la base xls des gros sinistres en perte d’exploitation, en France, sur la période 1985-2000. Côté lectures complémentaires, je recommande la lecture de Introduction à la réassurance, publié par Swiss Re, ou ainsi que quelques documents plus techniques, comme The Pareto model in property reinsurance , Exposure rating, ou Designing property reinsurance programmes encore Introduction to reinsurance accounting. Plusieurs réassureurs (et courtiers en réassurance) publient des études techniques sur leurs sites, http://swissre.com/, http://munichre.com/, http://aon.com/, http://scor.com/ ou encorehttp://guycarp.com/. Sinon je renvois aux notes de cours de Peter Antal, quantitative methods in reinsurance.

Et histoire de mettre à jour mes transparents, les sinistres les plus chers, pour les compagnies d’assurance et de réassurance : http://businessinsider.com/… donne le classement suivant en dollars de 2010 (on pourra aussi consulter http://media.swissre.com/…)

Hurricane Katrina (US, Bahamas, Cuba, Aug. 2005), $ 72.3 billion
Tōhoku earthquake and tsunami (Japan, March 2011), $ 35 billion
Hurricane Andrew (US, Bahamas, August 1992), $ 25 billion
September 11 attacks (US) $ 23.1 billion
Northridge earthquake (US) $ 20.6 billion
Hurricane Ike (US, Haiti, Dominican Republic, Sept. 2005) $ 20.5 billion
Hurricane Ivan (US, Barbados, Sept. 2004) $ 14.9 billion
Hurrican Wilman (US, Mexico, Jamaica, Oct. 2005), $ 14 billion
Hurricane Rita (US, Cuba, Sept. 2005) $ 11.3 billion
Hurricane Charley (US, Cuba, Jamaica) $ 9.3. billion

A titre de comparaison, les chiffres d’affaires des plus gros réassureurs (prime émise en 2010) étaient, selonhttp://www.insurancenetworking.com/…

Munich Reinsurance Company $ 31.3 billion
Swiss Reinsurance Company Limited $ 24.7 billion
Hannover Rueckversicherung AG $ 15.1 billion
Berkshire Hathaway Inc. $ 14.4 billion
Lloyd’s $ 13 billion
SCOR S.E. $ 8.8 billion
Reinsurance Group of America Inc. $ 7.2 billion
Allianz S.E. $ 5.7 billion
PartnerRe Ltd. $ 4.9 billion
Everest Re Group Ltd. $ 4.2 billion

↧

More significant? so what…

October 30, 2013, 7:41 pm

≫ Next: Pricing reinsurance contracts, another case study

≪ Previous: Réassurance

Following my non-life insurance class, this morning, I had an interesting question from a student, that I will try to illustrate, and reformulate as accurately as possible. Consider a simple regression model, with one variable of interest, and one possible explanatory variable. Assume that we have two possible models, with the following output (yes, I do hide interesting parts here, but it is to get quickly to my student’s point)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92883    0.06391  14.534   <2e-16 ***
X           -0.12499    0.06108  -2.046   0.0421 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the first model – a GLM with some distribution, and some link function – and

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92901    0.06270  14.817   <2e-16 ***
X           -0.09883    0.05816  -1.699   0.0909 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

for the second one – with another GLM, with another distribution, but the same link function (I guess I could have changed it, but it does not really matter here). Then, I got the following statement “I would like to choose the first model because the explanatory variable is more significant, and therefore, this model should have a stronger predictive power“.

That’s a nice idea, isn’t it ? Actually, I guess this is why I love teaching, because I will never be able to think about such an idea by myself. Because when you look at that statement, somehow it could make sense. Except that from my point of view, it is not valid at all. My first thought was to recall is standard example in statistical inference : you cannot not claim that a distribution is better than another one just by looking at the parameter estimates.

> fitdistr(Y,"normal")
      mean          sd    
  0.93685011   0.90700830 
 (0.06413517) (0.04535042)
> fitdistr(Y,"exponential")
      rate   
  1.06740661 
 (0.07547704)

Can I claim that the Gaussian distribution is better than the exponential one because parameter estimates have smaller standard deviation ? Because somehow, this is what we did when we claimed previously that the first model was better than the second one.

Let me get back on the outputs of the two regressions, and let me explain what I did. Actually, I wanted to have a story close to the one on the Gaussian versus exponential fit. So I did generate some exponential random variable,

> set.seed(5)
> n=200
> U=runif(n); 
> Y=-log(U)

Here, we can visualize the histogram of this sample, as well as the the estimated exponential distribution

> hist(Y,proba=TRUE,col="light green",border="white",lwd=2,breaks=seq(0,5.3333333333333,by=.333333333))
> x=seq(0,6,by=.02)
> lines(x,dexp(x,1/mean(Y)),col="red",lty=2)

On top of that, let us fit a gamma distribution. Using a GLM (where the regression is here on a constant – only), just to practice because later on, we will use a gamma regression on that variable

> reg0=glm(Y~1,family=Gamma(link="identity"))
> a=reg0$coefficient
> b=summary(reg0)$dispersion
> lines(x,dgamma(x,shape=1/b,scale=a*b),col="blue")

Now, we need a covariate, to run some regressions. What I wanted is some variable slightly correlated with our previous variable. Slightly, just to make sure that our -value in the regression will be close to 5% or 10%. So here, I did generate a variable so that the pair has Clayton copula, with coefficient 0.1 (which is small, extremely small)

> a=.1
> set.seed(5)
> n=200
> U=runif(n); 
> V=(U^(-a)*(runif(n)^(-a/(1+a))-1)+1)^(-1/a)
> Y=-log(U)
> X=qnorm(V)

To visualize the copula of the variables, we can use

> cop=function(u,v){
+ (a+1)*(u*v)^(-(a+1))*
+ (u^(-a)+v^(-a)-1)^(-(2*a+1)/a) }
> x=y=seq(.05,.95,by=.05)
> z=outer(x,y,cop)
> mat=persp(x,y,z,col="green",shade=TRUE,xlim=c(0,1),ylim=c(0,1),zlim=c(0,2),theta=-30,
+ ticktype ="detailed",zlab="")

We should be not far away from the independence (actually, there is a negative – significant – correlation (Pearson’s correlation)). Now, consider two models,

a Gaussian model (here a standard linear model)
a gamma model, with a linear link function

The outputs are the following (you will recognize the outputs given previously)

> reg1=lm(Y~X)
> reg2=glm(Y~X,family=Gamma(link="identity"))
> summary(reg1)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92883    0.06391  14.534   <2e-16 ***
X           -0.12499    0.06108  -2.046   0.0421 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9021 on 198 degrees of freedom
Multiple R-squared:  0.02071,	Adjusted R-squared:  0.01576 
F-statistic: 4.187 on 1 and 198 DF,  p-value: 0.04206

> summary(reg2)

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.92901    0.06270  14.817   <2e-16 ***
X           -0.09883    0.05816  -1.699   0.0909 .  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for Gamma family taken to be 0.9086447)

    Null deviance: 229.72  on 199  degrees of freedom
Residual deviance: 226.58  on 198  degrees of freedom
AIC: 379.22

Number of Fisher Scoring iterations: 10

And here are the two predictions,

So, which model should we use? As usual, my answer will be “let’s have a look at the data” instead of looking only at tables of figures. Using some code posted a few days ago, let us visualize the two regressions. The Gaussian model is here

(for the lower part, I do not go below 0 since we do have, here, a positive variable that we would like to model) while the gamma on is here

And if we believe that the explanatory variable has no predictive power (since we can claim that the parameter is not significant in the regression), and we remove it from the regression, we get

Here, I do believe that the gamma (not to say the exponential) model is better because it is clearly more coherent with properties of the variable of interest. I trust more the confidence interval obtained above on the gamma model, than the one obtained with a Gaussian distribution. Even if the parameter in the regression is “more significant”.

↧

Pricing reinsurance contracts, another case study

November 5, 2013, 8:44 am

≫ Next: Modèle de régression et interaction(s) entre facteurs

≪ Previous: More significant? so what…

A reinsurance case study for tomorrow’s class. The goal will be to price some nonproportional reinsurance contract, for business interruption claims. Consider the following dataset,

> library(gdata)
>  db=read.xls(
+ "http://perso.univ-rennes1.fr/arthur.charpentier/SIN_1985_2000-PE.xls",
+  sheet=1)
Content type 'application/vnd.ms-excel' length 183808 bytes (179 Kb)
open URL
==================================================
downloaded 179 Kb

As for any (standard) insurance contract, there are two parts in the pricing

the expected number of claims
the average cost of individual claims

Here, we do not have covariates (but it might be possible to use some, like the kind of industry, the location, etc).

Let us start with the expected number of claims, per year. Here is the daily frequency,

The data are rather old… but somehow, it is a good thing since after ten years, we can expect that most of the claims have been settled (we’ll discuss claims dynamic starting next week). To plot the graph above, we use

> date=db$DSUR
> D=as.Date(as.character(date),format="%Y%m%d")
> vD=seq(min(D),max(D),by=1)
> sD=table(D)
> d1=as.Date(names(sD))
> d2=vD[-which(vD%in%d1)]
> vecteur.date=c(d1,d2)
> vecteur.cpte=c(as.numeric(sD),rep(0,length(d2)))
> base=data.frame(date=vecteur.date,cpte=vecteur.cpte)
> plot(vecteur.date,vecteur.cpte,type="h",xlim=as.Date(as.character(
+ c(19850101,20111231)),format="%Y%m%d"))

Then, we can get a prediction of the daily number of business interruption claims, e.g. for any day in 2010 (assume that we had to price a reinsurance contract a few years ago), using a (standard) Poisson regression

> regdate=glm(cpte~date,data=base,family=poisson(link="log"))
> nd2010=data.frame(date=seq(as.Date(as.character(20100101),format="%Y%m%d"),
+ as.Date(as.character(20101231),format="%Y%m%d"),by=1))
> pred2010 =predict(regdate,newdata=nd2010,type="response")
> sum(pred2010)
[1] 159.4757

Observe that using old data has drawbacks, since we got much more uncertainty if we use a regression on time (to include some possible trend)

Say we have something like 160 claims over a given year, on average.

> plot(D,db$COUTSIN,type="h")

Let us now focus on the cost of those claims. We have 2,400 claims in our dataset, to fit a model (or at least estimate how much a reinsurance contract might cost us). Assume that we would like to purchase a reinsurance contract for our very large claims. Like the two largest per year. Over 16 years, the decutible should be close to the cost of the 32nd largest claim, which was close to 15 million.

> quantile(db$COUTSIN,1-32/2400)/1e6
98.66667% 
 15.34579 
> abline(h=quantile(db$COUTSIN,1-32/2400),col="blue")

So consider some reinsurance contract with a deductible of 15 million. Unfortunately, we cannot find unlimited covers. So let us assume that a reinsurance company agrees for such a deductible, but with a limited cover of 35 million. The average cost (for the reinsurance company) is $http://latex.codecogs.com/gif.latex?\mathbb{E}(g(X))$ where

$http://latex.codecogs.com/gif.latex?g(x)=\min\{35,\max\{x-15,0\}\}$

A first idea is to look at the first cost, i.e. the empirical average of that indemnity, on our portfolio. The indemnity function is

> indemn=function(x) pmin((x-15)*(x>15),50-15)

we can check on a few losses that it is actually what we wish to compute

> indemn(5)
[1] 0
> indemn(20)
[1] 5
> indemn(50)
[1] 35

Now, if the compute the average repayment by the reinsurance company, over 16 years, we get

> mean(indemn(db$COUTSIN/1e6))
[1] 0.1624292

So, per claim, the reinsurance company will pay, on average 162,430. With 160 claims per year, the pure premium should be close to 26 million

> mean(indemn(db$COUTSIN/1e6))*160
[1] 25.98867

(again, for a 35 million cover, for some claims that should occur, on average, twice a year). As we will see, a standard model in reinsurance is the Pareto distribution (or to be more specific, a Generalized Pareto one),

There are three parameters here

the threshold $http://latex.codecogs.com/gif.latex?\mu$ (that we will consider as fixed, but we will see its impact on reinsurance pricing)
the scale parameter $http://latex.codecogs.com/gif.latex?\sigma$ (called $http://latex.codecogs.com/gif.latex?\beta$ in R)
the tail index $http://latex.codecogs.com/gif.latex?\xi$

The strategy is to consider a threshold below our deductible, e.g. 12 million. Then, given that the loss exceed 12 million, we can fit a Generalized Pareto distribution,

> gpd.PL <- gpd(db$COUTSIN,12e6)$par.ests
> gpd.PL
          xi         beta 
7.004147e-01 4.400115e+06

and compute

>  E <- function(yinf,ysup,xi,beta,threshold){
+    as.numeric(integrate(function(x) (x-yinf)*dgpd(x,xi,mu=threshold,beta),
+    lower=yinf,upper=ysup)$value+
+    (1-pgpd(ysup,xi,mu=threshold,beta))*(ysup-yinf))
+  }

Here, given that a claim exceeds 12 million, the average repayment is close to 6 million

> E(15e6,50e6,gpd.PL[1],gpd.PL[2],12e6)
[1] 6058125

Now, we have to take into account the probability to reach 12 million, which is here

> mean(db$COUTSIN>12e6)
[1] 0.02639296

So, if we summarize, we have on average 160 claims per year,

> p
[1] 159.4757

Only 2.6% will exceed 12 million

> mean(db$COUTSIN>12e6)
[1] 0.02639296

So, the yearly frequency of claism larger than 12 million is 4.2 claims

> p*mean(db$COUTSIN>12e6)
[1] 4.209036

And for a claim that exceed 12 million, the average repayment is

> E(15e6,50e6,gpd.PL[1],gpd.PL[2],12e6)
[1] 6058125

So, the pure premium should be close to

> p*mean(db$COUTSIN>12e6)*E(15e6,50e6,gpd.PL[1],gpd.PL[2],12e6)
[1] 25498867

which (hopefully) is close to the empirical value we got. Actually, it is also possible to look at the impact of the threshold parameter, since it is clearly and intermediate value that could be changed. I mean, why 12 and not 10? Consider

> esp=function(threshold=12e6,p=sum(pred2010)){
+  (gpd.PL <- gpd(db$COUTSIN,threshold)$par.ests)
+  return(p*mean(db$COUTSIN>threshold)*E(15e6,50e6,gpd.PL[1],gpd.PL[2],threshold))
+  }

We can plot the pure premium as a function of that threshold,

> seuils=seq(1e6,15e6,by=1e6)
> plot(seuils,Vectorize(esp)(seuils),type="b",col="red")

which is between 24 and 26 for large thresholds. Again, that is only the first step, and we can price a higher reinsurance layer, like a reinsurance contract with a deductible of 50 million (we have our previous reinsurance contract for claims below that threshold), and a cover of 50 million, for instance. For those high layers, it become interesting to have a parametric model, which should be more robust than the empirical average.

↧

Modèle de régression et interaction(s) entre facteurs

November 9, 2013, 9:54 pm

≫ Next: Claims reserving (introduction)

≪ Previous: Pricing reinsurance contracts, another case study

Dans un modèle de régression, on veut écrire

Quand on se limite à un modèle linéaire, on écrit

Mais on de doute que l’on rate quelque chose… en particulier, on va rater toutes les interactions possibles. On peut croiser les variables, et supposer que

qui peut s’étendre d’avantage, à l’ordre 3,

voire davantage.

Supposons que nos variables soient ici qualitatives, et plus précisément binaires. Prenons un exemple simple, avec des données (classiques) en risque de crédit¹. On peut trouver la base via

library(evtree)
db=GermanCredit

ou encore directement

myVariableNames = c("checking_status","duration","credit_history",
"purpose","credit_amount","savings","employment","installment_rate",
"personal_status","other_parties","residence_since","property_magnitude",
"age","other_payment_plans","housing","existing_credits","job",
"num_dependents","telephone","foreign_worker","class")

GermanCredit = read.table(
"http://archive.ics.uci.edu/ml/machine-learning-databases/statlog/german/german.data",
header=FALSE,col.names=myVariableNames)

Retenons pour commencer trois variables explicatives,

db=data.frame(Y=GermanCredit$class-1,
X1=GermanCredit$checking_status%in%c("A12","A13"),
X2=GermanCredit$credit_history%in%c("A30","A31"),
X3=GermanCredit$savings%in%c("A61","A62"))
reg=glm(Y~X1+X2+X3,data=db,family=binomial)
summary(reg)

La régression sans interaction donne ici

Call:
glm(formula = Y ~ X1 + X2 + X3, family = binomial, data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5431  -0.8421  -0.6295   1.3994   1.9999  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -1.8544     0.1699 -10.915  < 2e-16 ***
X1TRUE        0.3363     0.1496   2.249   0.0245 *  
X2TRUE        1.3462     0.2347   5.735 9.76e-09 ***
X3TRUE        1.0001     0.1787   5.596 2.19e-08 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1221.7  on 999  degrees of freedom
Residual deviance: 1143.6  on 996  degrees of freedom
AIC: 1151.6

Number of Fisher Scoring iterations: 4

Il existe plusieurs interactions possibles ici (limitons nous aux paires). C’est ce que l’on observe quand on fait la régression

reg=glm(Y~X1+X2+X3+X1:X2+X1:X3+X2:X3,data=db,family=binomial)
summary(reg)

Call:
glm(formula = Y ~ X1 + X2 + X3 + X1:X2 + X1:X3 + X2:X3, family = binomial, 
    data = db)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5369  -0.8281  -0.6439   1.3954   1.9638  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept)   -1.77109    0.20070  -8.825  < 2e-16 ***
X1TRUE         0.30296    0.33737   0.898 0.369186    
X2TRUE         0.88353    0.54255   1.628 0.103421    
X3TRUE         0.87709    0.22583   3.884 0.000103 ***
X1TRUE:X2TRUE -0.37917    0.49343  -0.768 0.442225    
X1TRUE:X3TRUE  0.09178    0.37278   0.246 0.805522    
X2TRUE:X3TRUE  0.80923    0.58185   1.391 0.164293    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1221.7  on 999  degrees of freedom
Residual deviance: 1141.0  on 993  degrees of freedom
AIC: 1155

Number of Fisher Scoring iterations: 4

On peut faire un dessin pour visualiser les interactions : on a trois sommets (nos trois variables), et on visualiser les interactions

indices=cbind(c(1,2,3),c(1,1,2),c(2,3,3))
k=3
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",
xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

ce qui donne ici, pour nos trois variables

Ce modèle pourrait sembler incomplet, car on ne regarde que les interactions entre les modalités, par paires. En fait, c’est parce qu’il manque (visuellement) les variables non-croisées. On peut les rajouter si on veut (au risque de surcharger le dessin)

cercle=function(c,r,cl) lines(c[1]+r*cos(seq(0,2*pi,length=501)),
c[2]+r*sin(seq(0,2*pi,length=501)),col=cl)

reg=glm(Y~X1+X2+X3+X1:X2+X1:X3+X2:X3,data=db,family=binomial)
indices=cbind(c(1,2,3),c(1,1,2),c(2,3,3))
k=3
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}
for(i in 1:k){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

soit ici

Si on change le ‘sens‘ de nos variables (en recodant a l’envers, en permutant les vrais et les faux), on obtient le graphique suivant

dbinv=db
dbinv[,2:k]=1-dbinv[,2:k]
reg=glm(Y~X1+X2+X3+X1:X2+X1:X3+X2:X3,data=dbinv,family=binomial)
indices=cbind(c(1,2,3),c(1,1,2),c(2,3,3))
k=3
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}
for(i in 1:k){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

qui peut alors être comparé au graphique précédant

Avec 5 variables, on augmente les interactions possibles… même si beaucoup risquent d’être non-significatifs. On peut déjà se focaliser sur les paires possibles d’interactions croisées. Pour simplifier le code, on va utiliser deux fonctions locales,

vrepeach=function(x,e){
v=NULL
for(i in 1:length(e)){v=c(v,rep(x[i],each=e[i]))}
return(v)}
vreplength=function(x,l){
v=NULL
for(i in 1:length(l)){v=c(v,x[l[i]:length(x)])}
return(v)}

et ensuite, on adapte le code précédant

indices=cbind(1:(k*(k-1)/2),vrepeach(1:(k-1),(k-1):1),vreplength(2:k,1:(k-1)))
formule="Y~1"
for(i in 1:k) formule=paste(formule,"+X",i,sep="")
for(i in 1:nrow(indices)) formule=paste(formule,"+X",indices[i,2],":X",indices[i,3],sep="")
reg=glm(formule,data=db,family=binomial)
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}
for(i in 1:k){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

ce qui donne un schéma plus complexe,

On peut aussi prendre juste 2 variables, prenant 3 et 4 modalités respectivement. On va extraire deux variables indicatrices pour la première (la modalité restante sera la modalité de référence) et trois pour la seconde,

db=data.frame(Y=GermanCredit$class-1,
X1=GermanCredit$checking_status=="A12",
X2=GermanCredit$checking_status=="A13",
X3=GermanCredit$checking_status=="A14",
X4=GermanCredit$employment%in%c("A72","A73"),
X5=GermanCredit$employment%in%c("A74","A75"))
k=5
indices=cbind(1:(k*(k-1)/2),vrepeach(1:(k-1),(k-1):1),vreplength(2:k,1:(k-1)))
formule="Y~1"
for(i in 1:k) formule=paste(formule,"+X",i,sep="")
for(i in 1:nrow(indices)) formule=paste(formule,"+X",indices[i,2],":X",indices[i,3],sep="")
reg=glm(formule,data=db,family=binomial)
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
if(!is.na(coefficients(reg)[1+k+i])){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}}
for(i in 1:k){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

On voit que plusieurs interactions ne sont alors plus possibles, sur la partie gauche (les trois modalités de la même variable) et sur la partie droite

On peut d’ailleurs simplifier les graphs, en ne visualisant que les interactions significatives.

indices=cbind(1:(k*(k-1)/2),vrepeach(1:(k-1),(k-1):1),vreplength(2:k,1:(k-1)))
formule="Y~1"
for(i in 1:k) formule=paste(formule,"+X",i,sep="")
for(i in 1:nrow(indices)) formule=paste(formule,"+X",indices[i,2],":X",indices[i,3],sep="")
reg=glm(formule,data=db,family=binomial)
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
if(!is.na(coefficients(reg)[1+k+i])){
if(summary(reg)$coefficients[1+k+i,4]<.1){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}}}
for(i in 1:k){
if(summary(reg)$coefficients[1+i]<.1){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

soit ici

Ici, une seule interactions croisée est significative, et presque toutes les variables le sont. Et si on reprend le modèle avec 5 facteurs,

db=data.frame(Y=GermanCredit$class-1,X1=GermanCredit$checking_status%in%c("A12","A13"),
X2=GermanCredit$credit_history%in%c("A30","A31"),
X3=GermanCredit$savings%in%c("A61","A62"),
X4=GermanCredit$employment%in%c("A71","A72"),
X5=GermanCredit$other_payment_plans=="A143")

indices=cbind(1:(k*(k-1)/2),vrepeach(1:(k-1),(k-1):1),vreplength(2:k,1:(k-1)))
formule="Y~1"
for(i in 1:k) formule=paste(formule,"+X",i,sep="")
for(i in 1:nrow(indices)) formule=paste(formule,"+X",indices[i,2],":X",indices[i,3],sep="")
reg=glm(formule,data=db,family=binomial)
theta=pi/2+2*pi*(0:(k-1))/k
sommetX=cos(theta)
sommetY=sin(theta)
plot(sommetX,sommetY,cex=1,axes=FALSE,xlab="",ylab="",xlim=c(-1.5,1.5),ylim=c(-1.5,1.5))
for(i in 1:nrow(indices)){
if(!is.na(coefficients(reg)[1+k+i])){
if(summary(reg)$coefficients[1+k+i,4]<.1){
segments(sommetX[indices[i,2]],sommetY[indices[i,2]],
sommetX[indices[i,3]],sommetY[indices[i,3]],col="grey")
text(mean(sommetX[indices[i,2:3]]),mean(sommetY[indices[i,2:3]]),
trunc(10000*coefficients(reg)[1+k+i])/10000)
}}}
for(i in 1:k){
if(summary(reg)$coefficients[1+i]<.1){
cercle(c(cos(theta)[i]*1.18,sin(theta)[i]*1.18),.18,"grey")
text(cos(theta)[i]*1.35,sin(theta)[i]*1.35,
trunc(10000*coefficients(reg)[1+i])/10000)
}}
points(sommetX,sommetY,cex=6,pch=19,col="yellow")
points(sommetX,sommetY,cex=6,pch=1)
text(sommetX,sommetY,1:k)

on obtient

Je ne sais pas si mes graphiques sont pertinents, ou pas. Mais je trouve ça joli. En fait, je suis tombé un peu par hasard² sur les Tables de Taguchi, développées par Gen’ichi Taguchi (田口玄一). Le soucis est que je n’ai rien compris… Enfin, disons que je croyais comprendre, puis j’ai continué à faire des dessins… Si quelqu’un pourrait m’expliquer sur mon exemple les graphiques de Taguchi, je suis preneur ! car je doute que ce soit ce que je fais depuis tout à l’heure…

^{1. Cette base est largement utilisée dans le quatrième chapitre de Computational Actuarial Science with R, à paraître dans les mois à venir.}

^{2.En l’occurence, le hasard est @Benavent qui a suscité ma curiosité ce matin en me parlant de ces tables, dont je n’avais alors jamais entendu parlé ! J’avais même lu rapidement Taniguchi (谷口ジロー) et je ne voyais pas le rapport avec les statistiques….}

↧

Claims reserving (introduction)

November 11, 2013, 6:31 am

≫ Next: Triangles et provisionnement

≪ Previous: Modèle de régression et interaction(s) entre facteurs

Mercredi, on commence la modélisation du passif des compagnies d’assurance IARD. Plus particulièrement, nous parlerons des provisions pour sinistres à payer, ou “provision for claims outstanding (PCO)” selon la terminologie anglaise, i.e. “the estimated total cost of ultimate settlement of all claims incurred before the date of record, whether reported or not, less any amounts already paid out in respect thereof.” Je renvoie à la lecture de Le contrôle de la solvabilité des compagnies d’assurance en ligne sur le site de l’OCDE, pour une vision globale des approches de ces provisions. La SOA avait publié un rapport en 2009, Comparison of Incurred But Not Reported IBNR Methods que j’encourage à lire.

Le livre de référence sur le sujet est celui de Mario Wüthrich et Michael Merz. Les premiers chapitres (correspondant a ce qui sera vu en cours) peuvent être téléchargés sur http://actuaries.ch/…

Nous aborderons mercredi les triangles. Parmi les triangles que nous manipulerons

> source("http://perso.univ-rennes1.fr/arthur.charpentier/bases.R")

qui contient plusieurs fichiers, dont

> PAID
     [,1] [,2] [,3] [,4] [,5] [,6]
[1,] 3209 4372 4411 4428 4435 4456
[2,] 3367 4659 4696 4720 4730   NA
[3,] 3871 5345 5338 5420   NA   NA
[4,] 4239 5917 6020   NA   NA   NA
[5,] 4929 6794   NA   NA   NA   NA
[6,] 5217   NA   NA   NA   NA   NA

ainsi que le triangle évoqué sur http://rworkingparty.wikidot.com/

> OthLiabData = read.csv("http://www.casact.org/research/reserve_data/othliab_pos.csv",header=TRUE, sep=",")
> library(ChainLadder)
> OL = SumData=ddply(OthLiabData,.(AccidentYear,DevelopmentYear,DevelopmentLag),summarise,IncurLoss=sum(IncurLoss_h1-BulkLoss_h1),
+ CumPaidLoss=sum(CumPaidLoss_h1), EarnedPremDIR=sum(EarnedPremDIR_h1))
> LossTri = as.triangle(OL, origin="AccidentYear",
+ dev = "DevelopmentLag", value="IncurLoss")
> Year = as.triangle(OL, origin="AccidentYear",
+ dev = "DevelopmentLag", value="DevelopmentYear")
> TRIANGLE=LossTri
> TRIANGLE[Year>1997]=NA
> TRIANGLE
      dev
origin      1      2      3      4      5      6      7      8      9     10
  1988 128747 195938 241180 283447 297402 308815 314126 317027 319135 319559
  1989 135147 208767 270979 304488 330066 339871 344742 347800 353245     NA
  1990 152400 238665 297495 348826 359413 364865 372436 372163     NA     NA
  1991 151812 266245 357430 400405 423172 442329 460713     NA     NA     NA
  1992 163737 269170 347469 381251 424810 451221     NA     NA     NA     NA
  1993 187756 358573 431410 476674 504667     NA     NA     NA     NA     NA
  1994 210590 351270 486947 581599     NA     NA     NA     NA     NA     NA
  1995 213141 351363 444272     NA     NA     NA     NA     NA     NA     NA
  1996 237162 378987     NA     NA     NA     NA     NA     NA     NA     NA
  1997 220509     NA     NA     NA     NA     NA     NA     NA     NA     NA

Les transparents sont en ligne,

Pour les lectures complémentaires, Best Estimates for Reserves de Glen Barnett et Ben Zehnwirth est téléchargeable sur http://casact.org/pubs/…. En 2004 , Ben Zehnwirth, Julie Sims et Mark Shapland ont publié Will Your Next Reserve Increase Be Your Last (en ligne sur http://contingencies.org/janfeb04/…) Lors des prochaines séances, nous aborderons les méthodes par simulations. J’encourage la lecture de about the bootstrap, http://insureware.com/Library/… Sinon, des compléments plus spécifiques sont téléchargeables,

The Chain Ladder and Tweedie Distributed Claims Data par Greg Taylor
Chain-Ladder Bias: Its Reason and Meaning par Leigh Joseph Halliwell
Estimating Predictive Distributions for Loss Reserve Models par Glenn Meyers
Using a Bayesian Approach for Claims Reserving par Mario Wuthrich
Obtaining Predictive Distributions for Reserves Which Incorporate Expert Opinion par Richard Verrall
Loss Reserve Estimates: A Statistical Approach for Determining “Reasonableness” par Mark Shapland
Munich Chain Ladder: A Reserving Method that Reduces the Gap between IBNR Projections Based on Paid Losses and IBNR Projections Based on Incurred Losses par Gerhard Quarg et Thomas Mack
The Bornhuetter-Ferguson Principle par Klaus Schmidt et Mathias Zocher
On the Importance of Dispersion Modeling for Claims Reserving: An Application with the Tweedie Distribution par Jean-Philippe Boucher et Danail Davidov
Quantifying Uncertainty in Reserve Estimates par Zia Rehman et Stuart Klugman
Robustifying Reserving par Gary Venter et Dumaria Rulina Tampubolon, voir aussi Robustifying Reserving
Bootstrap Estimation of the Predictive Distributions of Reserves Using Paid and Incurred Claims par Huijuan Liu et Richard Verrall
Predictive Distributions for Reserves Which Separate True IBNR and IBNER Claims par Huijuan Liu et Richard Verrall
The Retrospective Testing of Stochastic Loss Reserve Models, par Glenn Meyers
A GLM-Based Approach to Adjusting for Changes in Case Reserve Adequacy par Larry Decker
A Method for Modelling Varying Run-off Evolutions in Claims Reserving par Richard Verrall
A Nonlinear Regression Model of Incurred But Not Reported Losses par Scott Stelljes
Back-Testing the ODP Bootstrap of the Paid Chain-Ladder Model with Actual Historical Claims Data par Jessica (Weng Kah) Leong, Shaun Wang et Han Chen
A Practical Way to Estimate One-Year Reserve Risk par Ira Robbin
Closed-Form Distribution of Prediction Uncertainty in Chain Ladder Reserving by Bayesian Approach par Ji Yao
The Prediction Error of Bornhuetter-Ferguson par Thomas Mack
Stochastic Loss Reserving with the Collective Risk Model par Glenn Meyers
On the Accuracy of Loss Reserving Methodology par Tapio Boles et Andy Staudt
Bootstrapping Generalized Linear Models for Development Triangles Using Deviance Residuals par Thomas Hartl
Fitting a GLM to Incomplete Development Triangles par Thomas Hartl
Gauss—Markov Loss Prediction in a Linear Model par Alexander Ludwig et Klaus Schmidt
Anatomy of Actuarial Methods of Loss Reserving par Prakash Narayan
Bootstrap Modeling: Beyond the Basics par Mark R. Shapland et Jessica (Weng Kah) Leong
GLM Invariants par Fred Klinker
The Retrospective Testing of Stochastic Loss Reserve Models par Glenn Meyers et Peng Shi
A GLM-Based Approach to Adjusting for Changes in Case Reserve Adequacy par Larry Decker
Testing the Assumptions of Age-to-age Factors, par Gary Venter, voir aussi http://rationalargumentator.com/actuaryguide/…

↧

Triangles et provisionnement

November 14, 2013, 11:42 am

≫ Next: Binomial regression model

≪ Previous: Claims reserving (introduction)

Pour le quatrième devoir d’actuariat IARD 2, on va travailler sur des (vrais) triangles de paiements de compagnies d’assurance. Les données ont été collectées par Glenn Meyers et Peng Shi, et sont en ligne sur http://casact.org/research/…. A partir du ‘code’ de groupe, pour récupérer ses données, il faut utiliser la petite fonction écrite dans le fichier

> source("http://freakonometrics.free.fr/codeACT2040-4.txt")

Par exemple, pour le groupe 10,

> extract.triangle(10)
$triangle.increments
        0    1     2     3    4     5    6   7   8   9
1988 1249 2843  4801  6623 8290 10264    1 431 586 714
1989  946 1983  7024  7415 7771  4489 2431 913 638  NA
1990 1765 2978  5111 10617 6285  6898 9628   1  NA  NA
1991 1408 3818  4542 11869 3702  6235 6164  NA  NA  NA
1992 1647 5981  5220  5475 3620  1196   NA  NA  NA  NA
1993 7566 7425 14475 19624 8500    NA   NA  NA  NA  NA
1994 2299 4557 17098 37542   NA    NA   NA  NA  NA  NA
1995 4959 8582  9462    NA   NA    NA   NA  NA  NA  NA
1996 6063 3644    NA    NA   NA    NA   NA  NA  NA  NA
1997 6507   NA    NA    NA   NA    NA   NA  NA  NA  NA

$name
[1] "Federal Ins Co Grp"

$source
[1] "prodliab_pos"

Pour le rapport, je veux avoir le nom de la compagnie, ainsi que la ‘source’ (PP Auto, Workers Compensation,Commercial Auto,Medical Malpractice,Product Liability, ouOther Liability). Pour jouer avec les données, il suffit d’utiliser

> T=extract.triangle(10)$triangle.increments
> T
        0    1     2     3    4     5    6   7   8   9
1988 1249 2843  4801  6623 8290 10264    1 431 586 714
1989  946 1983  7024  7415 7771  4489 2431 913 638  NA
1990 1765 2978  5111 10617 6285  6898 9628   1  NA  NA
1991 1408 3818  4542 11869 3702  6235 6164  NA  NA  NA
1992 1647 5981  5220  5475 3620  1196   NA  NA  NA  NA
1993 7566 7425 14475 19624 8500    NA   NA  NA  NA  NA
1994 2299 4557 17098 37542   NA    NA   NA  NA  NA  NA
1995 4959 8582  9462    NA   NA    NA   NA  NA  NA  NA
1996 6063 3644    NA    NA   NA    NA   NA  NA  NA  NA
1997 6507   NA    NA    NA   NA    NA   NA  NA  NA  NA

Il s’agit de triangle d’incréments, la ‘colonne’ de gauche étant l’année de survenance. Je n’ai pas testé toutes les bases, merci de me dire rapidement s’il y a des soucis.

↧

Binomial regression model

November 18, 2013, 10:47 am

≫ Next: Données, examen final

≪ Previous: Triangles et provisionnement

Most of the time, when we introduce binomial models, such as the logistic or probit models, we discuss only Bernoulli variables, . This year (actually also the year before), I discuss extensions to multinomial regressions, where is a function on some simplex. The multinomial logistic model was mention here. The idea is to consider, for instance with three possible classes

the following model

and

Now, what about a real Binomial model, , where ‘s are known. How should we run such a regression model ? Consider the following dataset

> set.seed(1)
> n=100
> N=1+rpois(n,5)
> X1=runif(n)
> X2=rexp(n)
> s=X2-X1-2
> p=exp(s)/(1+exp(s))
> vY=NULL
> for(i in 1:n){
+ Y=rbinom(1,prob=p[i],size=N[i])
+ vY=c(vY,Y)
+ }
> db=data.frame(Y=vY,N=N,X1,X2)
> head(db,4)
  Y N        X1         X2
1 0 5 0.6547239 0.76318001
2 1 5 0.3531973 1.57271671
3 3 6 0.2702601 1.83564098
4 1 9 0.9926841 0.03715227

My first idea was to say that it should be simple since if (and only if)

where are i.i.d. random variables . So, a natural idea is to generate the dataset containing the ‘s

> vY=vX1=vX2=vN=NULL;
> for(i in 1:n){
+ vY=c(vY,c(rep(0,db$N[i]-db$Y[i]),rep(1,db$Y[i])))
+ vX1=c(vX1,rep(db$X1[i],db$N[i]))
+ vX2=c(vX2,rep(db$X2[i],db$N[i]))
+ }
> largedb=data.frame(Z=vY,X1=vX1,X2=vX2)
> head(largedb,16)
   Z        X1       X2
1  0 0.6547239 0.763180
2  0 0.6547239 0.763180
3  0 0.6547239 0.763180
4  0 0.6547239 0.763180
5  0 0.6547239 0.763180
6  0 0.3531973 1.572717
7  0 0.3531973 1.572717
8  0 0.3531973 1.572717
9  0 0.3531973 1.572717
10 1 0.3531973 1.572717
11 0 0.2702601 1.835641
12 0 0.2702601 1.835641
13 0 0.2702601 1.835641
14 1 0.2702601 1.835641
15 1 0.2702601 1.835641
16 1 0.2702601 1.835641

Then, we run a standard Bernoulli regression on those ‘s

> reg1=glm(Z~X1+X2,family=binomial,data=largedb)

But actually, if look around, on the internet, you can see (e.g. in Alan Agresti’s R_web.pdf chapter) that it is possible to run – directly – a binomial regression, using the following syntax,

> reg2=glm(Y/N~X1+X2,family=binomial,weights=N,data=db)

I was a bit scared because a few weeks ago, I tried two techniques to run a regression on contingency tables, and the output were different (on the standard deviation actually, not the estimation). Here we have the same thing

> coefficients(summary(reg1))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875231 -7.146399 8.908380e-13
X1          -0.9159275  0.3970303 -2.306946 2.105781e-02
X2           1.0564059  0.1360305  7.765952 8.103448e-15
> coefficients(summary(reg2))
              Estimate Std. Error   z value     Pr(>|z|)
(Intercept) -2.0547550  0.2875234 -7.146392 8.908817e-13
X1          -0.9159275  0.3970313 -2.306941 2.105813e-02
X2           1.0564059  0.1360310  7.765923 8.105285e-15

almost if we take into account the fact that numerical algorithm might be ran with different starting point. But the going thing is that – theoretically – the output should be exactly the same. Simply because here, we solve the same first order conditions ! The likelihood in the first case was

which will be simplified as

which is what we have in the second case. The use of the weight function will insure that the variance here are equal.

↧

Données, examen final

November 26, 2013, 7:16 am

≫ Next: Examen ACT2040

≪ Previous: Binomial regression model

L’examen final du cours ACT2040 aura lieu, comme confirmé par courriel il y a plusieurs jours, le mercredi 11 décembre, pendant 3 heures. Il y aura surtout des sorties à commenter. Les sorties seront obtenues à partir des trois jeux de données suivants

> CORPOREL=read.table("http://freakonometrics.free.fr/corporel-2040.csv",header=TRUE,sep=";")
> tail(CORPOREL)
         degre age cat.age sexe vehicule anciennete alcool cat.alc
76336  indemne  45   40-49    M  voiture          6      0    0-20
76337 corporel  59   50-59    F  voiture          2      0    0-20
76338  indemne  34   30-39    F  voiture          2      0    0-20
76339  indemne  29   26-29    F  voiture          5      9    0-20
76340  indemne  64     60+    M  voiture          0      0    0-20
76341  indemne  57   50-59    F  voiture          1      0    0-20

Il s’agit d’observations d’accident automobiles, en Australie, la variable d’intérêt étant ici la gravité (le degré) de l’accident. Parmi les variables explicatives, l’âge du conducteur (en variable ‘continue‘ et en classes arbitraires) et son degré d’alcoolémie (en variable ‘continue‘ et en classes liées à des critères légaux, en g/10L), l’âge de la voiture, et le sexe du conducteur (M pour les hommes et F pour les femmes).

Le second jeu de données est un triangle de paiements, cumulés, avec la prime acquise, par année calendaire,

> load("http://freakonometrics.free.fr/triangle-intra.txt")
> intra
$triangle
        0     1     2     3     4     5     6     7     8     9
1988 5244  9228 10823 11352 11791 12082 12120 12199 12215 12215
1989 5984  9939 11725 12346 12746 12909 13034 13109 13113    NA
1990 7452 12421 14171 14752 15066 15354 15637 15720    NA    NA
1991 7115 11117 12488 13274 13662 13859 13872    NA    NA    NA
1992 5753  8969  9917 10697 11135 11282    NA    NA    NA    NA
1993 3937  6524  7989  8543  8757    NA    NA    NA    NA    NA
1994 5127  8212  8976  9325    NA    NA    NA    NA    NA    NA
1995 5046  8006  8984    NA    NA    NA    NA    NA    NA    NA
1996 5129  8202    NA    NA    NA    NA    NA    NA    NA    NA
1997 3689    NA    NA    NA    NA    NA    NA    NA    NA    NA

$primes
 [1] 15883 16689 18029 17858 16709 14212 15083 15131 15465 11217

Enfin, le dernier jeux de données sera un peu plus insolite pour ce cours. Il s’agit de données de tables de mortalité, au Canada

> DECES=read.table("http://freakonometrics.free.fr/DECES-CAN.csv",header=TRUE,sep=";")
> tail(DECES)
     D   E   A    Y
772 84 147 105 2010
773 39  76 106 2010
774 23  40 107 2010
775 15  20 108 2010
776  7   8 109 2010
777  5   5 110 2010

Par année Y, on observe un certain nombre de personnes en vie E, d’âge A au 1er janvier, et un certain nombre va décéder dans l’année, D. Par exemple, en 2010, il y avait 147 personnes de 105 ans en début d’année, 84 vont décéder dans l’année. L’idée sera d’utiliser les modèles de provisionnement pour modéliser les probabilités de décéder.

↧

Examen ACT2040

December 8, 2013, 2:25 pm

≫ Next: Actuariat IARD, correction

≪ Previous: Données, examen final

L’examen pour le cours ACT2040 aura lieu mercredi matin. Comme promis, les sorties informatiques qui seront utilisées sont maintenant en ligne [une coquille s'était glissée dans les dernières pages, une version corrigée a été mise ne ligne mardi a 15:10, a la place de la première version]. Comme je l’avais dit mercredi dernier, la régression binomiale sera utilisée, et la dernière partie portera sur de la construction de tables de mortalité prospective. Des détails sont évoqués dans Actuariat avec R. Sinon, pour ceux qui le souhaitent, les examens des années passées sont en ligne,

final, décembre 2011, un énoncé basé sur quelques sorties informatiques, avec des éléments de correction
intra, février 2013, énoncé de l’examen intra en pdf a partir des sorties ici, avec des éléments de correction en pdf (et un petit complément)
final, avril 2013, un énoncé est en ligne ainsi que des éléments de correction.

Les calculatrices sont fortement recommandées (un téléphone, aussi intelligent soit-il, n’est pas une calculatrice).

↧

Actuariat IARD, correction

December 18, 2013, 7:39 pm

≪ Previous: Examen ACT2040

Mercredi dernier avait lieu l’examen final d’ACT2040. La sortie informatique utilisée (avec quelques valeurs manquantes qu’il fallait retrouver, et qui comprenait des indications numérotées pour des questions spécifiques) et le questionnaire sont en ligne, ainsi que des éléments de réponse. Comme évoqué dans la page de garde de la ‘correction’, le barème sera proche de celui annoncé, mais un coefficient multiplicatif devrait être appliqué car je me suis rendu compte, en tapant la correction que le devoir était long. Je vais profiter des jours à venir pour corriger l’examen, avant d’attaquer la correction des devoirs, pendant les vacances. Toutes les remarques sur la ‘correction’ sont les bienvenues. Sinon, bonnes vacances…

↧