Data Mining - Decision Tree (DT) Algorithm

1 - About

Desicion Tree (DT) are supervised Classification algorithms.

They are:

  • easy to interpret (due to the tree structure)

Decision trees extract predictive information in the form of human-understandable tree-rules. Decision Tree is a algorithm useful for many classification problems that that can help explain the model’s logic using human-readable “If…. Then…” rules.

  • reliable and robust algorithm.
  • simple to implement.

They can:

  • work on categorical attributes,
  • handle many attributes, so big p smaller n cases.

Each decision in the tree can be seen as an feature.

3 - Algorithm

The creation of a tree is a quest for:

  • purity (only pure node: only yes or no)
  • the smallest tree

At each level, choose the attribute that produces the “purest” nodes (ie choosing the attribute with the highest information gain)

Algorithm:

4 - Overfitting

Decision Trees are prone to overfitting:

  • whereas ensemble of tree are not. See random forest
  • Pruning can help: remove or aggregate sub-trees that provide little discriminatory power

Decision Trees can overfit badly because of the highly complex decision boundaries it can produce; the effect is ameliorated, but rarely completely eliminated with Pruning.

5 - Example

5.1 - Titanic (Survive Yes or No)

Titanic Data Set

if Ticket Class = "1" then
   if Sex = "female" then Survive = "yes"
   if Sex = "male" and age < 5 then Survive = "yes"
if Ticket Class = "1" then
   if Sex = "female" then Survive = "yes"
   if Sex = "male" then Survive = "no"
if Ticket Class = "3"
   if Sex = "male" then Survive = "no"
   if Sex = "female" then 
      if Age < 4  then Survive = "yes"
      if Age >= 4 then Survive = "no"

Every path from the root is a rule

6 - Type

6.1 - Univariate

Single tests at the nodes

6.2 - multivariate

Compound tests at the nodes

7 - Documentation / Reference

data_mining/decision_tree.txt · Last modified: 2016/06/04 11:56 by gerardnico