R - Bootstrap

About

Bootstrap lets you get a look at the sampling distribution of statistics, for which it's really hard to develop theoretical versions. Bootstrap gives us a really easy way of doing statistics when the theory is very hard.

For example, if we calculate a statistics <math>\alpha</math> through a non-linear formula, the theory in this case is very hard and it's therefore difficult to answer the following questions:

what is the sampling variability of <math>\alpha</math> ?
what is the standard error of <math>\alpha</math> ?
How variable is it going to be?

This is a case where the bootstrap really helps out in order to answer this questions.

Bootstrap is very handy way of getting very good reliable estimates of standard error for nasty statistics.

The Standard Bootstrap

The non-linear function

alpha=function(x,y){
  vx=var(x)
  vy=var(y)
  cxy=cov(x,y)
  (vy-cxy)/(vx+vy-2*cxy)
}

The <math>\alpha</math> function will return the last line that was evaluated.

Bootstrap Wrapper Function

alpha.fn=function(data, index){
  with(data[index,],alpha(X,Y))
}

where:

Index represents a bootstrap sample index with number between 1 and N. As the bootstrap does a re-sample of our training observations and some observations can be represented more than once and some not at all, the index may have the same observation index more than once, once or not at all.
the function, with takes first argument of data frame and then some commands. Using the data in the data frame, execute the commands. And the main value of with is that you can use the named variables x and y that are in the data frame.

Bootstrap

Since a bootstrap involves random sampling and that we want reproducible results we set the random number seed.

require(boot)
set.seed(1)

#  1,000 bootstraps.
boot.out=boot(myDataFrame,alpha.fn,R=1000)
boot.out

ORDINARY NONPARAMETRIC BOOTSTRAP


Call:
boot(data = myDataFrame, statistic = alpha.fn, R = 1000)


Bootstrap Statistics :
     original        bias    std. error
t1* 0.5758321 -7.315422e-05  0.08861826

This is really a fast function.

Plot

plot(boot.out)

The first plot is a histogram of the bootstrap sampling distribution.
The second plot is a qqplot, which plots the ordered values against the ordered statistics of a Gaussian. If it lines up on a straight line, you may say that the first plot looks like a pretty nice symmetric gaussian distribution

The block bootstrap

Correlation

A block bootstrap must be applied when there is a strong correlation between consecutive rows of the data matrix.

To see this correlation, you can plot the matrix.

matplot(dataFrame,type="l")

Effect of correlation on the standard error

When there is a strong correlation and that the data points are repeating, the sample size is in reality much smaller than the number of rows.

Which means that the standard error is underestimate because standard errors is divided by the sample size.

The wrapper function

getBetaX1=function(myData){
  fit=lm(y~X1+X2,data=myData)
  fit$coefficients["X1"]
}

From a data frame, it fits a linear model and return the regression coefficient of the variable X1.

The time series Bootstrapping function

Block bootstrap with a block size of 100

require(boot)
set.seed(1)

tsboot(myDataFrame,getBetaX1,R=50,sim="fixed",l=100)

BLOCK BOOTSTRAP FOR TIME SERIES

Fixed Block Length of 100 

Call:
tsboot(tseries = Xy, statistic = getBetaX1, R = 50, l = 100, 
    sim = "fixed")


Bootstrap Statistics :
     original     bias    std. error
t1* 0.1453263 0.06305497   0.1955884