Machine Learning - K-Nearest Neighbors (KNN) algorithm - Instance based learning
Table of Contents
1 - About
“Nearest‐neighbor” learning is also known as “Instance‐based” learning.
K-Nearest Neighbors, or KNN, is a family of simple:
- and regression algorithms
based on Similarity (Distance) calculation between instances.
Nearest Neighbor implements rote learning. It's based on a local average calculation.
It's a smoother algorithm.
Some experts have written that k-nearest neighbours do the best about one third of the time. It's so simple that, in the game of doing classification, you always want to have it in your toolbox.
If you use the nearest neighbor algorithm, take into account the fact that points near the boundary have fewer neighbors because some neighbors may be outside the boundary. You need to correct this bias.
2 - Articles Related
3 - Type of Model
3.1 - Regression
3.1.1 - Mean
3.1.2 - Neighbourhood
When they are too few data points for a X value in order to calculate the mean, the neighbourhood can be used.
3.2 - Classification
4 - Decision boundary
The nearest neighbor methode produces a linear decision boundary. It's a little bit more complicate as it produces a piece-wise linear decision of the decision boundary with sometimes a bunch of little linear pieces.
The perpendicular bisector of the line that joins the two closest points:
5 - Remedies against
5.1 - All attributes are equally important
Knn assumes that all attributes are equally important. Remedy: attribute selection or weights
5.2 - Noisy instances
Noisy instance are instance with a bad target class.
If the dataset is noisy , then by accident we might find an incorrectly classified training instance as the nearest one to our test instance.
You can guard against that by using:
- Weight instances according to prediction accuracy
- Identification of reliable “prototypes” for each class
6 - K Parameter
The K parameter is the first method to guard this method against noisy dataset.
6.1 - Best k value
An obvious issue with k nearest neighbour is how to choose a suitable value for the number of nearest neighbours used.
- If it's too small, the method is susceptible to noise in the data.
- If it's too large, the decision is smeared out, covering too great an area of instance space
The X axis shows the formula 1/K because since K is the number of neighbours, 1/k ⇐ 1.
6.2 - With a K close to the size of the whole data set
If we set k to be an extreme value, close to the size of the whole data set, then we're taking the distance of the test instance to all of points in the dataset and averaging those which will probably gives us something close to the baseline accuracy.
7 - Advantage/Inconvenient
7.1 - Curse of dimensionality
There is a theoretical guarantee that with a huge dataset and large values of k, you're going to get good results from nearest neighbour learning.
Nearest neighbourhood methods can be lousy when p (the number of variable) is large because of the curse of dimensionality. In high dimension, it's really difficult to stay local.
7.2 - Slow
slow as you need to scan entire training data to make each prediction.
8 - How to
8.1 - Discover if the dataset is noisy
By changing the k data set if the accuracy changes a lot between 1 and 5 k it's may be a noisy data set If the data set is noisy, the accuracy figures improves as k got little bit larger but then it would be starting to decrease again.
9 - Implementation
9.1 - Weka
In weka it's called IBk (instance-bases learning with parameter k) and it's in the lazy class folder. KNN is the K parameter. IBk's KNN parameter specifies the number of nearest neighbors to use when classifying a test instance, and the outcome is determined by majority vote.
Weka's IBk implementation has the “cross-validation” option that can help by choosing the best value automatically Weka uses cross-validation to select the best value for KNN (which is the same as k).