Ellis Valentiner
April 9, 2015
Classification trees are a method commonly used in data mining. They are popular becase they are easy to interpret and reflect how (some) decisions are made.
The basic idea is to (recursively) divide the feature space into regions and fit a simple ‘model’ for each region.
Details:
We can estimate the probability that the class label is \(k\) given that the feature vector lies in region \(M\).
How do we choose the best splits?
We want splits that best separate our class labels.
How large do we grow the tree (when do we stop splitting?)
We want the number of observations in the terminal node to be reasonably large otherwise we are just modeling noise.
Stopping rules:
There are several different methods for choosing the best split.
The proportion of the observations in that region that do not belong to the most common class.
\[\text{Error} = 1 - \text{max}(\hat{p}_{mk})\]
Measures the total variance across the \(K\) classes.
Small values indicates that a node contains predominantly observations from a single class (i.e. small variation).
Attempts to separate classes by focusing on one class at a time. It will always favor working on the largest or, if you use costs or weights, the most “important” class in a node.
\[\text{Gini} = \overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} (1 - \hat{p}_{mk})\]
Similar measure of node purity where small values also indicate node purity (like Gini).
Instead of separating one class at a time entropy tends to separate groups of classes, each accounting for roughly 50% of the observations.
Results in more evenly balanced trees.
\[\text{Deviance} = -\overset{K}{\underset{k = 1}{\Sigma}} \hat{p}_{mk} \text{log} \hat{p}_{mk}\]
It doesn’t really matter. Different split criterions usually give similar performance.
The tree
package uses defaults to deviance; random forests
uses Gini.
# Summary of dataset
summary(train_data)
## Group X1 X2
## Orange:297 Min. :-0.9804 Min. :-0.96966
## Blue :203 1st Qu.:-0.3941 1st Qu.:-0.34460
## Median : 0.1148 Median : 0.09573
## Mean : 0.1261 Mean : 0.07507
## 3rd Qu.: 0.6584 3rd Qu.: 0.52230
## Max. : 1.1442 Max. : 1.11792
# Print random 5 rows
head(train_data[sample(nrow(train_data)),])
## Group X1 X2
## 110 Blue 0.35868128 -0.1638366
## 45 Orange -0.34901851 0.1246819
## 66 Orange 0.90250625 0.5254130
## 438 Orange 0.05070714 0.6575863
## 324 Orange 0.76363161 0.1595582
## 14 Orange 0.81636073 0.2001557
# Load the `tree` package
library(tree)
# Fit the classification tree
tree_fit <- tree(Group ~ X1 + X2, data = train_data)
# Model summary
summary(tree_fit)
##
## Classification tree:
## tree(formula = Group ~ X1 + X2, data = train_data)
## Number of terminal nodes: 17
## Residual mean deviance: 0.242 = 116.9 / 483
## Misclassification error rate: 0.042 = 21 / 500
# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit)
# Plot the dendrogram
plot(tree_fit, type = 'uniform')
# Add split labels
text(tree_fit, all = TRUE)