Data Mining - Outliers Cases

Thomas Bayes

About

Outliers are cases that are unusual because they fall outside the distribution that is considered normal for the data.

The distance from the centre of a normal distribution indicates how typical a given point is with respect to the distribution of the data. Each case can be ranked according to the probability that it is either typical or atypical.

The presence of outliers can have a deleterious effect on many forms of data mining. Anomaly detection can be used to identify outliers before mining the data.

In a multidimensional dataset, outliers may only appear when looking at multiple dimensions whereas one one dimension they will be not far away from the mean / median.

Example

For example, census data might show:

  • a median household income of 70,000
  • and a mean household income of 80,000,

but one or two households might have an income of 200,000. These cases would probably be identified as outliers.

How to

find them

Outliers are outside of three standard deviations of the mean. In a normal distribution, 99% of the data falls above or below that threshold.





Discover More
Anomalies Election Fraud
Data Mining - (Anomaly|outlier) Detection

The goal of anomaly detection is to identify unusual or suspicious cases based on deviation from the norm within data that is seemingly homogeneous. Anomaly detection is an important tool: in data...
Thomas Bayes
Data Mining - Data (Preparation | Wrangling | Munging)

Data Preparation is a data step that prepares your data for further analyis. It's a key factor in any data project: mining, ai analytics Preparing has several steps that are explained below. ...
Thomas Bayes
Data Mining - Result Considerations

Before tackling a data mining problem, some considerations must be take into account in order to get good interpretations of the results. Strong correlations of data do not necessarily prove a cause-and-effect...
Box Plot
Data Visualization - Box Plot

A box plot is a good summary of a distribution and was invented by John Tukey. See Five-number summary The boxplot is a special case of the quantile function in that it only returns the 1st, 2nd and...
Mean
Distribution - (Mean|Average) (M| | )

The average is a measure of center that statisticians call the mean. To calculate the mean, you add all numbers and divide the total by the number of numbers (N). The mean is not resistant. The...
Data System Architecture
Distribution - Measures of (center|central tendency) (Mean, Median, Mode)

A Measure of central tendency is a measure that describes the middle or center point of a distribution. A good measure of central tendency is representative of the distribution. The mean, the median and...
Anscombe Regression
Statistics - Regression

Regression is a statistical analysis used: to predict scores on an numeric outcome variable, based on scores of: one predictor variable: simple regression or multiple predictor variables: multiple...
Thomas Bayes
Statistics - Resistant

A statistic that is not affected by outliers is called resistant.



Share this page:
Follow us:
Task Runner