Data Mining - ReSampling Validation

1 - About

Resampling method are a class of methods that estimate the test error by holding out a subset of the training set from the fitting process, and then applying the statistical learning method to those held out observations.

For supervised Learning problems, separate data sets are required for building (training) and testing the predictive models.

The training error is too optimistic. The more we fit to the data, the lower the training error but the test error can get higher if we over fit and it often will. For this reaons, models are fitted on part of the data, and then evaluated against on a holdout set (Test set).

Generally, the Build Activity splits the data into two mutually exclusive subsets:

  • the training set for building the model: Evaluate the model on the training set. Performance estimates obtained on the training set are overly optimistic because of overfitting.
  • the test set for testing the model. Evaluate the model on a separate test set

However, if the data is already split into Build and Test subsets, you can run a Build activity and specify the Split step.

Of course, the build data (training data) and test data must have the same column structure.

3 - Methods

Practical rule of thumb:

Resampling methods:

  • Cross-validation is a very important tool to get a good idea of the test set error of a model.
  • Bootstrap, on the other hand, is most useful to get an idea of the variability or standard deviation of an estimate and its bias.

3.1 - Two-fold Validation

Two-fold validation: randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.

3.2 - Percentage Split

Percentage Split (Fixed or Holdout) Leave out random N% of the data. For example, you might select 60% of the rows for building the model and 40% for testing the model. The algorithm is trained against the trained data and the accuracy is calculated on the whole data set.

3.3 - Cross-validation

Also known as validation:

3.4 - Bootstrap

Related: Bootstrap: Generate new training sets by sampling with replacement

Bootstrap is a very clever device for using the one, single training sample you have to estimate things like standard deviations.

data_mining/resampling.txt · Last modified: 2014/05/11 09:32 by gerardnico