# Statistics - Correlation (Coefficient analysis)

> (Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

### Table of Contents

## 1 - About

Correlation is a statistical analysis used to measure and
describe the relationship between **two** variables.

The Correlations coefficient is a statistic and it can range between +1 and -1

- +1 is a perfect positive correlation. If the scores goes up for one variable the score goes up on the other.
- > 0.8 is a strong correlation
- > 0.4 is a high correlation
- > 0.2 correlate
- < 0.2 is not a strong correlation
- < 0.1 doesn't correlate
- 0 is no correlation (independence)
- -1 is a perfect negative correlation. If the scores goes up for one variable the score goes down on the other.

Correlation is used:

- mainly to describe relationship
- and then it's used for prediction that leads to regression because when two variables are correlated, then one variable can be used to predict the other variable. In other words, if the variables X and Y are correlated, a regression can be used to predict Y from X.

If two variables are correlated, X and Y then a regression can be done in order to predict scores on Y from the scores on X.

Correlation demonstrates the relationship between two variables whereas regression provides an **equation** (with two or more variables) which is used to predict scores on an outcome variable.

Positive correlation only means that the univariate regression has a positive correlation. In a multiple regression, the sign (positive, negative) is dependent of the other variables.

## 2 - Articles Related

## 3 - Assumptions

## 4 - Correlation does not imply causation

Correlation does not imply causation but correlations are useful because they can be used to assess:

## 5 - Type

There are several types of correlation coefficients, for different variable types

- Pearson product-moment correlation coefficient (r) (When both variables are continuous)
- Point bi-serial correlation (When 1 variable is continuous and 1 is dichotomous)
- Phi coefficient (When both variables are dichotomous)
- Spearman rank correlation (When both variables are ordinal)

## 6 - Venn diagrams

Venn diagrams representation of a correlation between two variables X and Y.

Venn diagrams representing:

- All the variants in X,
- All the variants in Y
- And the overlap (the covariance). The overlap is can also be expressed as:
- the sum of the cross product between x and y (ie the covariance definition)
- the sum of the square for the model.

The degree to which x and y correlate is represented by the degree to which these two variance circles overlap. The correlation (degree|coefficient) is the systematic variance in Y that's explained by X.

The correlation is approaching:

- one for an high degree of overlap
- zero for no overlap

The residual is the unexplained variance in Y. Some of the variance in Y is explained by the model. Some if it is unexplained, that's the residual.