Trade-offs: Accuracy and interpretability, bias and variance

Trade-offs: Accuracy and interpretability, bias and varianceDr. D’Agostino McGowan1 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

Classification2 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

NotationYY is the response variable. It is qualitative
C(X)C(X) is the classifier that assigns a class CC to some future unlabeled observation, XX
3 / 12

Dr. Lucy D'Agostino McGowan adapted from slides by Hastie & Tibshirani

NotationYY is the response variable. It is qualitative
C(X)C(X) is the classifier that assigns a class CC to some future unlabeled observation, XX
Examples:Email can be classified as C=(spam, not spam)C=(spam, not spam)
Written number is one of C={0,1,2,…,9}C={0,1,2,…,9}

3 / 12

Classification Problem

What is the goal?

4 / 12

Classification Problem

What is the goal?

Build a classifier $C (X)$ that assigns a class label from $C$ to a future unlabeled observation $X$
Assess the uncertainty in each classification
Understand the roles of the different predictors among $X = (X_{1}, X_{2}, \dots, X_{p})$

4 / 12

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

5 / 12

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

How do you think we could calculate this?

5 / 12

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

How do you think we could calculate this?

In the plot, you could examine the mini-barplot at $x = 5$

5 / 12

Suppose there are $K$ elements in $C$ , numbered $1, 2, \dots, K$

$p_{k} (x) = P (Y = k | X = x), k = 1, 2, \dots, K$ These are conditional class probabilities at $x$

The Bayes optimal classifier at $x$ is

$C (x) = j if p_{j} (x) = max {p_{1} (x), p_{2} (x), \dots, p_{K} (x)}$

6 / 12

Notice that probability is a conditional probability
It is the probability that Y equals k given the observed preditor vector, $x$
Let's say we were using a Bayes Classifier for a two class problem, Y is 1 or 2. We would predict that the class is one if $P (Y = 1 | X = x_{0}) > 0.5$ and 2 otherwise

What if this was our data and there were no points at exactly $x = 5$ ? Then how could we calculate this?

7 / 12

What if this was our data and there were no points at exactly (x = 5)? Then how could we calculate this?

Nearest neighbor like before!

7 / 12

What if this was our data and there were no points at exactly (x = 5)? Then how could we calculate this?

Nearest neighbor like before!
This does break down as the dimensions grow, but the impact of $\hat{C} (x)$ is less than on ${\hat{p}}_{k} (x), k = 1, 2, \dots, K$

7 / 12

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

8 / 12

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error

8 / 12

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error
Some of the methods we will learn build structured models for $C (x)$ (support vector machines, for example)

8 / 12

Accuracy

Misclassification error rate

$E r r_{test} = {Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$

The Bayes Classifier using the true $p_{k} (x)$ has the smallest error
Some of the methods we will learn build structured models for $C (x)$ (support vector machines, for example)
Some build structured models for $p_{k} (x)$ (logistic regression, for example)

8 / 12

the test error rate ${Ave}_{i \in test} I [y_{i} \neq \hat{C} (x_{i})]$ is minimized on average by very simple classifier that assigns each observation to the most likely class, given its predictor values (that's the Bayes classifier)

K-Nearest-Neighbors example

9 / 12

Here is a simulated dataset of 100 observations in two groups, blue and orange
The purple dashed line represents the Bayes decision boundary
The orange background grid indicates the region where the test observations will be classified as orange, and the blue for the blue
We'd love to be able to use the Bayes classifier to but for real data, we don't know the conditional distribution of Y given X so computing the Bayes classifier is impossible
Alot of methods try to estimate the conditional distribution of Y given X and then classify a given observation to the class with the highest estimated probability
One method to do this is K-nearest neighbors

KNN (K = 10)

10 / 12

Again, the way KNN works is if K = 10, it is finding the 10 closest observations and calculating the probability of being orange or blue and will classify that point as such
So here is an example of K nearest neighbors where K is 10

KNN

11 / 12

Because this dataset has 100 data points, K can range from 1 to 100 where at 1, the error rate in the TRAINING data will be 0 but the test error rate may be really high. So we are trying to find the happy medium. The test error is going to have that same u-shape relationship, you want to find the bottom of that U

Trade-offs

12 / 12

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Trade-offs: Accuracy and interpretability, bias and variance

Dr. D’Agostino McGowan

Classification

Notation

Notation

Classification Problem

Classification Problem

Accuracy

Accuracy

Accuracy

Accuracy

K-Nearest-Neighbors example

KNN (K = 10)

KNN

Trade-offs

Classification

Help