In a standard linear model, we assume that Image may be NSFW.
Clik here to view.. Alternatives can be considered, when the linear assumption is too strong.
- Polynomial regression
A natural extension might be to assume some polynomial function,
Image may be NSFW.
Clik here to view.
Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters Image may be NSFW.
Clik here to view. can be obtained using least squares, where a regression of Image may be NSFW.
Clik here to view. on Image may be NSFW.
Clik here to view. is considered.
Even if this polynomial model is not the real one, it might still be a good approximation for Image may be NSFW.
Clik here to view.. Actually, from Stone-Weierstrass theorem, if Image may be NSFW.
Clik here to view. is continuous on some interval, then there is a uniform approximation of Image may be NSFW.
Clik here to view. by polynomial functions.
Just to illustrate, consider the following (simulated) dataset
set.seed(1) n=10 xr = seq(0,n,by=.1) yr = sin(xr/2)+rnorm(length(xr))/2 db = data.frame(x=xr,y=yr) plot(db)
Image may be NSFW.
Clik here to view.
with the standard regression line
reg = lm(y ~ x,data=db) abline(reg,col="red")
Image may be NSFW.
Clik here to view.
Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,
reg=lm(y~poly(x,5),data=db)
Image may be NSFW.
Clik here to view.
But if the degree is too large, then too many ‘oscillations’ are obtained,
reg=lm(y~poly(x,25),data=db)
Image may be NSFW.
Clik here to view.
and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes
plot(db)
attach(db)
lines(xr,predict(reg),col="red",lty=2)
yrm=yr;yrm[31]=yr[31]-2
regm=lm(yrm~poly(xr,25))
lines(xr,predict(regm),col="red")
Image may be NSFW.
Clik here to view.
- Local regression
Actually, if our interest is to have locally a good approximation of Image may be NSFW.
Clik here to view., why not use a local regression?
This can be done easily using a weighted regression, where, in the least square formulation, we consider
Image may be NSFW.
Clik here to view.
(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:
- here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is
Image may be NSFW.
Clik here to view.which can be solve explicitly, since
Image may be NSFW.
Clik here to view.
- so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point Image may be NSFW.
Clik here to view., then Image may be NSFW.
Clik here to view.should be proportional to some distance between Image may be NSFW.
Clik here to view.and Image may be NSFW.
Clik here to view.: if Image may be NSFW.
Clik here to view.is too far from Image may be NSFW.
Clik here to view., then it should not have to much influence on the prediction.
For instance, if we want to have a prediction at some point Image may be NSFW.
Clik here to view., consider Image may be NSFW.
Clik here to view.. With this model, we remove observations too far away,
Image may be NSFW.
Clik here to view.
Actually, here, it is the same as
reg=lm(yr~xr,subset=which(abs(xr-x0)<1)
A more general idea is to consider some kernel function Image may be NSFW.
Clik here to view. that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that
Image may be NSFW.
Clik here to view.
This is actually the so-called Nadaraya-Watson estimator of function Image may be NSFW.
Clik here to view..
In the previous case, we did consider a uniform kernel Image may be NSFW.
Clik here to view., with bandwith Image may be NSFW.
Clik here to view.,
But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,
Image may be NSFW.
Clik here to view.
This can be done using
fitloc0 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~1,data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
On our dataset, we can plot
ul=seq(0,10,by=.01) vl0=Vectorize(fitloc0)(ul) u0=seq(-2,7,by=.01) linearlocalconst=function(x0){ w=dnorm((xr-x0)) plot(db,cex=abs(w)*4) lines(ul,vl0,col="red") axis(3) axis(2) reg=lm(y~1,data=db,weights=w) u=seq(0,10,by=.02) v=predict(reg,newdata=data.frame(x=u)) lines(u,v,col="red",lwd=2) abline(v=c(0,x0,10),lty=2) } linearlocalconst(2)
Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression
Image may be NSFW.
Clik here to view.
Let us use an animation to visualize the construction of the curve. One can use
library(animate)
but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs
vx0=seq(1,9,by=.1) vx0=c(vx0,rev(vx0)) graphloc=function(i){ name=paste("local-reg-",100+i,".png",sep="") png(name,600,400) linearlocalconst(vx0[i]) dev.off()} for(i in 1:length(vx0)) graphloc(i)
and then, in a terminal, I simply use
convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif
Image may be NSFW.
Clik here to view.
Of course, it is possible to consider a linear model, locally,
fitloc1 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~poly(x,degree=1),data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
Image may be NSFW.
Clik here to view.
or even a quadratic (local) regression,
fitloc2 = function(x0){ w=dnorm((xr-x0)) reg=lm(y~poly(x,degree=2),data=db,weights=w) return(predict(reg,newdata=data.frame(x=x0)))}
Image may be NSFW.
Clik here to view.
Of course, we can change the bandwidth
Image may be NSFW.
Clik here to view.
To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider
Image may be NSFW.
Clik here to view.
where Image may be NSFW.
Clik here to view. is the prediction obtained using a local regression technique, with bandwidth Image may be NSFW.
Clik here to view.. And to get a more accurate (and optimal) bandwith Image may be NSFW.
Clik here to view. is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…
Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.
library(XML) library(downloader) file = "geos-tww.csv" html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all") html = xpathApply(html, "//table[@id='collectionTable']")[[1]] data = readHTMLTable(html) data = data[,-3] names(data)=c("no",names(data)[-1]) data=data[-(61:64),]
Let us reshape the dataset,
data$no = 1:96 data$mu = as.numeric(substr(as.character(data$Mean), 0, 4)) data$se = sd(data$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data$Count))) data$season = 1 + (data$no - 1)%/%12 data$season = factor(data$season) plot(data$no,data$mu,ylim=c(6,10)) segments(data$no,data$mu-1.96*data$se, data$no,data$mu+1.96*data$se,col="light blue")
Image may be NSFW.
Clik here to view.
As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.
plot(data$no,data$mu,ylim=c(6,10)) abline(v=12*(0:8)+.5,lty=2) for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s) lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }
Image may be NSFW.
Clik here to view.
Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.
db = data NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=5) plot(data$no,data$mu) lines(NW,col="red")
Image may be NSFW.
Clik here to view.
We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,
db$mu[95]=7 NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=12) plot(data$no,data$mu,ylim=c(6,10)) lines(NW,col="red")
Image may be NSFW.
Clik here to view.