Data Mining - Decision Tree (DT) Algorithm

1 - About

Desicion Tree (DT) are supervised Classification algorithms.

They are:

  • easy to interpret (due to the tree structure)

Decision trees extract predictive information in the form of human-understandable tree-rules. Decision Tree is a algorithm useful for many classification problems that that can help explain the model’s logic using human-readable “If…. Then…” rules.

  • reliable and robust algorithm.
  • simple to implement.

They can:

  • work on categorical attributes,
  • handle many attributes, so big p smaller n cases.

Each decision in the tree can be seen as an feature.

3 - Algorithm

The creation of a tree is a quest for:

  • purity (only pure node: only yes or no)
  • the smallest tree

At each level, choose the attribute that produces the “purest” nodes (ie choosing the attribute with the highest information gain)

Algorithm:

4 - Overfitting

Decision Trees are prone to overfitting:

  • whereas ensemble of tree are not. See random forest
  • Pruning can help: remove or aggregate sub-trees that provide little discriminatory power

Decision Trees can overfit badly because of the highly complex decision boundaries it can produce; the effect is ameliorated, but rarely completely eliminated with Pruning.

5 - Library

  • FFTrees - Create, visualize, and test fast-and-frugal decision trees (FFTs). FFTs are very simple decision trees for binary classification problems. FFTs can be preferable to more complex algorithms because they are easy to communicate, require very little information, and are robust against overfitting.

6 - Example

6.1 - Titanic (Survive Yes or No)

Titanic Data Set

if Ticket Class = "1" then
   if Sex = "female" then Survive = "yes"
   if Sex = "male" and age < 5 then Survive = "yes"
if Ticket Class = "1" then
   if Sex = "female" then Survive = "yes"
   if Sex = "male" then Survive = "no"
if Ticket Class = "3"
   if Sex = "male" then Survive = "no"
   if Sex = "female" then 
      if Age < 4  then Survive = "yes"
      if Age >= 4 then Survive = "no"

Every path from the root is a rule

7 - Type

7.1 - Univariate

Single tests at the nodes

7.2 - multivariate

Compound tests at the nodes

8 - Documentation / Reference

data_mining/decision_tree.txt · Last modified: 2018/06/05 10:26 by 162.158.63.206