Quantcast
Viewing all articles
Browse latest Browse all 57

Some heuristics about local regression and kernel smoothing

In a standard linear model, we assume that Image may be NSFW.
Clik here to view.
. Alternatives can be considered, when the linear assumption is too strong.

  • Polynomial regression

A natural extension might be to assume some polynomial function,

Image may be NSFW.
Clik here to view.

Again, in the standard linear model approach (with a conditional normal distribution using the GLM terminology), parameters Image may be NSFW.
Clik here to view.
can be obtained using least squares, where a regression of Image may be NSFW.
Clik here to view.
 on Image may be NSFW.
Clik here to view.
 is considered.

Even if this polynomial model is not the real one, it might still be a good approximation for Image may be NSFW.
Clik here to view.
. Actually, from Stone-Weierstrass theorem, if Image may be NSFW.
Clik here to view.
 is continuous on some interval, then there is a uniform approximation of Image may be NSFW.
Clik here to view.
 by polynomial functions.

Just to illustrate, consider the following (simulated) dataset

set.seed(1)
n=10
xr = seq(0,n,by=.1)
yr = sin(xr/2)+rnorm(length(xr))/2
db = data.frame(x=xr,y=yr)
plot(db)

Image may be NSFW.
Clik here to view.

with the standard regression line

reg = lm(y ~ x,data=db)
abline(reg,col="red")

Image may be NSFW.
Clik here to view.

Consider some polynomial regression. If the degree of the polynomial function is large enough, any kind of pattern can be obtained,

reg=lm(y~poly(x,5),data=db)

Image may be NSFW.
Clik here to view.

But if the degree is too large, then too many ‘oscillations’ are obtained,

reg=lm(y~poly(x,25),data=db)

Image may be NSFW.
Clik here to view.

and the estimation might be be seen as no longer robust: if we change one point, there might be important (local) changes

plot(db)
attach(db)
lines(xr,predict(reg),col="red",lty=2)
yrm=yr;yrm[31]=yr[31]-2 
regm=lm(yrm~poly(xr,25)) 
lines(xr,predict(regm),col="red")
Image may be NSFW.
Clik here to view.
  • Local regression

Actually, if our interest is to have locally a good approximation of  Image may be NSFW.
Clik here to view.
, why not use a local regression?

This can be done easily using a weighted regression, where, in the least square formulation, we consider

Image may be NSFW.
Clik here to view.

(it is possible to consider weights in the GLM framework, but let’s keep that for another post). Two comments here:

  • here I consider a linear model, but any polynomial model can be considered. Even a constant one. In that case, the optimization problem is

Image may be NSFW.
Clik here to view.
which can be solve explicitly, since

Image may be NSFW.
Clik here to view.

  • so far, nothing was mentioned about the weights. The idea is simple, here: if you can a good prediction at point Image may be NSFW.
    Clik here to view.
    , then Image may be NSFW.
    Clik here to view.
     should be proportional to some distance between Image may be NSFW.
    Clik here to view.
     and Image may be NSFW.
    Clik here to view.
    : if Image may be NSFW.
    Clik here to view.
     is too far from Image may be NSFW.
    Clik here to view.
    , then it should not have to much influence on the prediction.

For instance, if we want to have a prediction at some point Image may be NSFW.
Clik here to view.
, consider Image may be NSFW.
Clik here to view.
. With this model, we remove observations too far away,

Image may be NSFW.
Clik here to view.

Actually, here, it is the same as

reg=lm(yr~xr,subset=which(abs(xr-x0)<1)

A more general idea is to consider some kernel function Image may be NSFW.
Clik here to view.
that gives the shape of the weight function, and some bandwidth (usually denoted h) that gives the length of the neighborhood, so that

Image may be NSFW.
Clik here to view.

This is actually the so-called Nadaraya-Watson estimator of function Image may be NSFW.
Clik here to view.
.
In the previous case, we did consider a uniform kernel Image may be NSFW.
Clik here to view.
, with bandwith Image may be NSFW.
Clik here to view.
,

But using this weight function, with a strong discontinuity may not be the best idea… Why not a Gaussian kernel,

Image may be NSFW.
Clik here to view.

This can be done using

fitloc0 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~1,data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

On our dataset, we can plot

ul=seq(0,10,by=.01)
vl0=Vectorize(fitloc0)(ul)
u0=seq(-2,7,by=.01)
linearlocalconst=function(x0){
w=dnorm((xr-x0))
plot(db,cex=abs(w)*4)
lines(ul,vl0,col="red")
axis(3)
axis(2)
reg=lm(y~1,data=db,weights=w)
u=seq(0,10,by=.02)
v=predict(reg,newdata=data.frame(x=u))
lines(u,v,col="red",lwd=2)
abline(v=c(0,x0,10),lty=2)
}
linearlocalconst(2)

Here, we want a local regression at point 2. The horizonal line below is the regression (the size of the point is proportional to the wieght). The curve, in red, is the evolution of the local regression

Image may be NSFW.
Clik here to view.

Let us use an animation to visualize the construction of the curve. One can use

library(animate)

but for some reasons, I cannot install the package easily on Linux. And it is not a big deal. We can still use a loop to generate some graphs

vx0=seq(1,9,by=.1)
vx0=c(vx0,rev(vx0))
graphloc=function(i){
name=paste("local-reg-",100+i,".png",sep="")
png(name,600,400)
linearlocalconst(vx0[i])
dev.off()}

for(i in 1:length(vx0)) graphloc(i)

and then, in a terminal, I simply use

    convert -delay 25 /home/freak/local-reg-1*.png /home/freak/local-reg.gif

Image may be NSFW.
Clik here to view.

Of course, it is possible to consider a linear model, locally,

fitloc1 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=1),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

Image may be NSFW.
Clik here to view.

or even a quadratic (local) regression,

fitloc2 = function(x0){
w=dnorm((xr-x0))
reg=lm(y~poly(x,degree=2),data=db,weights=w)
return(predict(reg,newdata=data.frame(x=x0)))}

Image may be NSFW.
Clik here to view.

Of course, we can change the bandwidth

Image may be NSFW.
Clik here to view.

To conclude the technical part this post, observe that, in practise, we have to choose the shape of the weight function (the so-called kernel). But there are (simple) technique to select the “optimal” bandwidth h. The idea of cross validation is to consider

Image may be NSFW.
Clik here to view.

where Image may be NSFW.
Clik here to view.
 is the prediction obtained using a local regression technique, with bandwidth Image may be NSFW.
Clik here to view.
. And to get a more accurate (and optimal) bandwith Image may be NSFW.
Clik here to view.
 is obtained using a model estimated on a sample where the ith observation was removed. But again, that is not the main point in this post, so let’s keep that for another one…

Perhaps we can try on some real data? Inspired from a great post on http://f.briatte.org/teaching/ida/092_smoothing.html, by François Briatte, consider the Global Episode Opinion Survey, from some TV show, http://geos.tv/index.php/index?sid=189 , like Dexter.

library(XML)
library(downloader)
file = "geos-tww.csv"
html = htmlParse("http://www.geos.tv/index.php/list?sid=189&collection=all")
html = xpathApply(html, "//table[@id='collectionTable']")[[1]]
data = readHTMLTable(html)
data = data[,-3]
names(data)=c("no",names(data)[-1])
data=data[-(61:64),]

Let us reshape the dataset,

data$no = 1:96
data$mu = as.numeric(substr(as.character(data$Mean), 0, 4))
data$se =  sd(data$mu,na.rm=TRUE)/sqrt(as.numeric(as.character(data$Count)))
data$season = 1 + (data$no - 1)%/%12
data$season = factor(data$season)
plot(data$no,data$mu,ylim=c(6,10))
segments(data$no,data$mu-1.96*data$se,
data$no,data$mu+1.96*data$se,col="light blue")

Image may be NSFW.
Clik here to view.

As done by François, we compute some kind of standard error, just to reflect uncertainty. But we won’t really use it.

plot(data$no,data$mu,ylim=c(6,10))
abline(v=12*(0:8)+.5,lty=2)
for(s in 1:8){reg=lm(mu~no,data=db,subset=season==s)
lines((s-1)*12+1:12,predict(reg)[1:12],col="red") }

Image may be NSFW.
Clik here to view.

Henre, we assume that all seasons should be considered as completely independent… which might not be a great assumption.

db = data
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=5)
plot(data$no,data$mu)
lines(NW,col="red")

Image may be NSFW.
Clik here to view.

We can try to look the curve with a larger bandwidth. The problem is that there is a missing value, at the end. If we (arbitrarily) fill it, we can run a kernel regression,

db$mu[95]=7
NW = ksmooth(db$no,db$mu,kernel = "normal",bandwidth=12) 
plot(data$no,data$mu,ylim=c(6,10)) 
lines(NW,col="red")

Image may be NSFW.
Clik here to view.


Viewing all articles
Browse latest Browse all 57