# Statistics - Fisher (Multiple Linear Discriminant Analysis|multi-variant Gaussian)

multi-variant Gaussians.

Fisher has describe first this analysis with his Iris Data Set.

A Fisher's linear discriminant analysis or Gaussian LDA measures which centroid from each class is the closest. The distance calculation takes into account the covariance of the variables.

## 3 - Steps

### 3.1 - Density Function

Pictures of the Statistics - (Probability) Density Function (PDF) made with R:

Formulas:

$$f(x)=\frac{1}{(2\pi)^{p/2}|\sum|^{1/2}} e^{\displaystyle - \frac{1}{2}(x-\mu)^T \sum ^{-1}(x-\mu) }$$

This formula is just a generalization of the simple formula we had for a single variable. This is called a covariance matrix,

### 3.2 - Discriminant Function

The discriminant functions is telling you how to classify.

The idea of the discriminant function is to compute one of these discriminant function for each of the classes, and then you classify it to the class for which it's largest. You pick the discriminant function that's largest.

If you go through the simplifications with linear algebra, you can do the cancellation and get the below formula:

$$\delta_k(x) = x^T \sum ^{-1} \mu_k - \frac{1}{2} \mu^T_k \sum ^{-1} \mu_k + log \pi_k$$

where:

• $x^T$

is the transpose of the vector x containing all variables

• $\mu^T_k$

is the transpose of the vector

$\mu_k$

containing all means

Despite its complex form it's still a linear function in x, where the coefficient is a vector.

Simplified form:

$$\delta_k(x) = c_{k0} + c_{k1}x_1 + c_{k2}x_2 + c_{k3} x_3 + \dots + c_{kp} x_p$$

where:

• x is not more a vector but an expansion of the previous vector expression. $c_{k1}x_1 + \dots + c_{kp} x_p = x^T \sum ^{-1} \mu_k$
• $c_{k0}= - \frac{1}{2} \mu^T_k \sum ^{-1} \mu_k + log \pi_k$

### 3.3 - Probabilities

Once we have estimates

$\delta_k(x)$

, we can turn these into estimates for class probabilities:

$$\hat{Pr}(Y=k|X=x) = \frac {\displaystyle e ^{\displaystyle \hat{\delta}_k(x)} } {\displaystyle \sum_{l=1}^K e ^{\displaystyle \hat{\delta}_l(x)}}$$

### 3.4 - Classification

So classifying to the largest

$\hat{\delta}_k(x)$

amounts to classifying to the class for which

$\hat{Pr}(Y=k|X=x)$

is largest.

When K = 2, we classify to class 2 if

$\hat{Pr}(Y=k|X=x) >= 0.5$

else to class 1.

## 4 - Illustration

### 4.1 - p = 2 and K = 3 classes

Here

$\pi_1 = \pi_2 = \pi_3 = \frac{1}{3}$

The dashed lines are known as the Bayes decision boundaries. Were they known, they would yield the fewest misclassification errors, among all possible classifiers.