+ - 0:00:00
Notes for current slide
Notes for next slide

Decision trees - Classification trees

Dr. D’Agostino McGowan

1 / 11

Classification Trees

  • Very similar to regression trees except it is used to predict a qualitative response rather than a quantitative one
2 / 11

Classification Trees

  • Very similar to regression trees except it is used to predict a qualitative response rather than a quantitative one
  • We predict that each observation belongs to the most commonly occuring class of the training observations in a given region
2 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
3 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
  • Instead of RSS, we can use:
3 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
  • Instead of RSS, we can use:
  • Gini index:

G=k=1Kp^mk(1p^mk)

3 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
  • Instead of RSS, we can use:
  • Gini index:

G=k=1Kp^mk(1p^mk)

  • This is a measure of total variance across the K classes. If all of the p^mk values are close to zero or one, this will be small
3 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
  • Instead of RSS, we can use:
  • Gini index:

G=k=1Kp^mk(1p^mk)

  • This is a measure of total variance across the K classes. If all of the p^mk values are close to zero or one, this will be small
  • The Gini index is a measure of node purity small values indicate that node contains predominantly observations from a single class
3 / 11

Fitting classification trees

  • We use recursive binary splitting to grow the tree
  • Instead of RSS, we can use:
  • Gini index:

G=k=1Kp^mk(1p^mk)

  • This is a measure of total variance across the K classes. If all of the p^mk values are close to zero or one, this will be small
  • The Gini index is a measure of node purity small values indicate that node contains predominantly observations from a single class
  • In R, this can be estimated using the gain_capture() function.
3 / 11

Classification tree - Heart Disease Example

  • Classifying whether 303 patients have heart disease based on 13 predictors (Age, Sex, Chol, etc)
4 / 11

1. Split the data into a cross-validation set

heart_cv <- vfold_cv(heart, v = 5)
5 / 11

1. Split the data into a cross-validation set

heart_cv <- vfold_cv(heart, v = 5)

How many folds do I have?

5 / 11

2. Create a model specification that tunes based on complexity, α

tree_spec <- decision_tree(
cost_complexity = tune(),
tree_depth = 10,
mode = "classification") %>%
set_engine("rpart")
6 / 11

3. Fit the model on the cross validation set

grid <- expand_grid(cost_complexity = seq(0.01, 0.05, by = 0.01))
model <- tune_grid(tree_spec,
HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs +
RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca,
grid = grid,
resamples = heart_cv,
metrics = metric_set(gain_capture, accuracy))
7 / 11

3. Fit the model on the cross validation set

grid <- expand_grid(cost_complexity = seq(0.01, 0.05, by = 0.01))
model <- tune_grid(tree_spec,
HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs +
RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca,
grid = grid,
resamples = heart_cv,
metrics = metric_set(gain_capture, accuracy))

What αs am I trying?

7 / 11

5. Choose α that minimizes the Gini Index

best <- model %>%
select_best(metric = "gain_capture") %>%
pull()
8 / 11

6. Fit the final model

final_spec <- decision_tree(
cost_complexity = best,
tree_depth = 10,
mode = "classification") %>%
set_engine("rpart")
final_model <- fit(final_spec,
HD ~ Age + Sex + ChestPain + RestBP + Chol + Fbs +
RestECG + MaxHR + ExAng + Oldpeak + Slope + Ca + Thal,
data = heart)
9 / 11

7. Examine how the final model does on the full sample

final_model %>%
predict(new_data = heart) %>%
bind_cols(heart) %>%
conf_mat(truth = HD, estimate = .pred_class) %>%
autoplot(type = "heatmap")

10 / 11

Decision trees

Pros

  • simple
  • easy to interpret
11 / 11

Decision trees

Pros

  • simple
  • easy to interpret

Cons

  • not often competitive in terms of predictive accuracy
  • Next class we will discuss how to combine multiple trees to improve accuracy
11 / 11

Classification Trees

  • Very similar to regression trees except it is used to predict a qualitative response rather than a quantitative one
2 / 11
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow