Data Mining - Outliers Cases

1 - About

Outliers are cases that are unusual because they fall outside the distribution that is considered normal for the data.

The distance from the centre of a normal distribution indicates how typical a given point is with respect to the distribution of the data. Each case can be ranked according to the probability that it is either typical or atypical.

The presence of outliers can have a deleterious effect on many forms of data mining. Anomaly detection can be used to identify outliers before mining the data.

In a multidimensional dataset, outliers may only appear when looking at multiple dimensions whereas one one dimension they will be not far away from the mean / median.

3 - Example

For example, census data might show:

  • a median household income of $70,000
  • and a mean household income of $80,000,

but one or two households might have an income of $200,000. These cases would probably be identified as outliers.

4 - How to

4.1 - find them

Outliers are outside of three standard deviations of the mean. In a normal distribution, 99% of the data falls above or below that threshold.

data_mining/outlier.txt · Last modified: 2015/04/14 16:04 by gerardnico