Data Mining - ReSampling Validation
Table of Contents
1 - About
Resampling method are a class of methods that estimate the test error by holding out a subset of the training set from the fitting process, and then applying the statistical learning method to those held out observations.
The training error is too optimistic. The more we fit to the data, the lower the training error but the test error can get higher if we over fit and it often will. For this reaons, models are fitted on part of the data, and then evaluated against on a holdout set (Test set).
- the test set for testing the model. Evaluate the model on a separate test set
However, if the data is already split into Build and Test subsets, you can run a Build activity and specify the Split step.
Of course, the build data (training data) and test data must have the same column structure.
2 - Articles Related
3 - Methods
Practical rule of thumb:
- Lots of data? – use percentage split
- Cross-validation is a very important tool to get a good idea of the test set error of a model.
3.1 - Two-fold Validation
Two-fold validation: randomly divide the available set of samples into two parts: a training set and a validation or hold-out set.
3.2 - Percentage Split
Percentage Split (Fixed or Holdout) Leave out random N% of the data. For example, you might select 60% of the rows for building the model and 40% for testing the model. The algorithm is trained against the trained data and the accuracy is calculated on the whole data set.
3.3 - Cross-validation
Also known as validation:
3.4 - Bootstrap
Related: Bootstrap: Generate new training sets by sampling with replacement
Bootstrap is a very clever device for using the one, single training sample you have to estimate things like standard deviations.