Statistics - (Factor Variable|Qualitative Predictor)

> (Statistics|Probability|Machine Learning|Data Mining|Data and Knowledge Discovery|Pattern Recognition|Data Science|Data Analysis)

1 - About

A factor is a qualitative explanatory variable.

Each factor has two or more levels, i.e., different values of the factor.

Combinations of factor levels are called treatments.


  • character variable,
  • or a string variable

3 - Modelling a factor

We can't put categorical predictors into a regression analysis function. We need to make it a numeric variable in some way. That's where dummy coding comes in.

3.1 - Two levels

Example with gender which has two levels (male or female) We create a new variable

<MATH> X = \left\{\begin{array}{ll} 1 & \text{ if the person is a male} \\ 0 & \text{ if the person is a female} \end{array}\right. </MATH>

Resulting model:

<MATH> Y_i = B_0 + B_1 X_i + \epsilon_i = \left\{\begin{array}{ll} B_0 + X_i + \epsilon_i & \text{ if the ith person is a male} \\ B_0 + \epsilon_i & \text{ if the ith person is a female} \end{array}\right. </MATH>

3.2 - More than two

With more than two levels, we create additional dummy variables.

For example, for a colour variable with three levels (blue, red, green), we create two dummy variables.

<MATH> \begin{array}{lll} X_1 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is blue} \\ 0 & \text{ if the colour is not blue} \end{array}\right. \\ X_2 & = & \left\{\begin{array}{ll} 1 & \text{ if the colour is red} \\ 0 & \text{ if the colour is not red} \end{array}\right. \\ \end{array} </MATH>

Then both of these variables can be used in the regression equation, in order to obtain the following model:

<MATH> Y_i = B_0 + B_1 X_{i1} + B_2 X_{i2} + \epsilon_i = \left\{\begin{array}{ll} B_0 + B_1 + \epsilon_i & \text{ if the ith colour is blue} \\ B_0 + B_2 + \epsilon_i & \text{ if the ith colour is a red} \\ B_0 + \epsilon_i & \text{ if the ith colour is a green } \\ \end{array}\right. </MATH>

There will always be one fewer dummy variable than the number of levels. The level with no dummy variable Green in this example is known as the baseline.