About
Desicion Tree (DT) are supervised Classification algorithms.
They are:
- easy to interpret (due to the tree structure)
- a boolean function (If each decision is binary ie false or true)
Decision trees extract predictive information in the form of human-understandable tree-rules. Decision Tree is a algorithm useful for many classification problems that that can help explain the model’s logic using human-readable “If…. Then…” rules.
- reliable and robust algorithm.
- simple to implement.
They can:
- work on categorical attributes,
- handle many attributes, so big p smaller n cases.
Each decision in the tree can be seen as an feature.
Articles Related
Algorithm
The creation of a tree is a quest for:
- purity (only pure node: only yes or no)
- the smallest tree
At each level, choose the attribute that produces the “purest” nodes (ie choosing the attribute with the highest information gain)
Algorithm:
Overfitting
Decision Trees are prone to overfitting:
- whereas ensemble of tree are not. See random forest
- Pruning can help: remove or aggregate sub-trees that provide little discriminatory power
Decision Trees can overfit badly because of the highly complex decision boundaries it can produce; the effect is ameliorated, but rarely completely eliminated with Pruning.
Library
- package=FFTrees - Create, visualize, and test fast-and-frugal decision trees (FFTs). FFTs are very simple decision trees for binary classification problems. FFTs can be preferable to more complex algorithms because they are easy to communicate, require very little information, and are robust against overfitting.
Example
Titanic (Survive Yes or No)
if Ticket Class = "1" then
if Sex = "female" then Survive = "yes"
if Sex = "male" and age < 5 then Survive = "yes"
if Ticket Class = "1" then
if Sex = "female" then Survive = "yes"
if Sex = "male" then Survive = "no"
if Ticket Class = "3"
if Sex = "male" then Survive = "no"
if Sex = "female" then
if Age < 4 then Survive = "yes"
if Age >= 4 then Survive = "no"
Every path from the root is a rule
Type
Univariate
Single tests at the nodes
multivariate
Compound tests at the nodes