Ridge Regression

Ridge RegressionDr. D’Agostino McGowan1 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

RSS!

$(y - X \hat{β})^{T} (y - X \hat{β})$

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

RSS!

$(y - X \hat{β})^{T} (y - X \hat{β})$

What is the solution ( $\hat{β}$ ) to this?

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

RSS!

$(y - X \hat{β})^{T} (y - X \hat{β})$

What is the solution ( $\hat{β}$ ) to this?

${(X}^{T} X)^{- 1} X^{T} y$

2 / 38

Linear Regression Review

What is $X$ ?

3 / 38

Linear Regression Review

What is $X$ ?

the design matrix!

3 / 38

Linear Regression Review

How did we get ${(X}^{T} X)^{- 1} X^{T} y$ ?

$\begin{aligned} - 2 X^{T} y + 2 X^{T} X \hat{β} & = 0 \\ 2 X^{T} X \hat{β} & = 2 X^{T} y \\ X^{T} X \hat{β} & = X^{T} y \\ (X^{T} X)^{- 1} X^{T} X \hat{β} & = (X^{T} X)^{- 1} X^{T} y \\ \underset{I}{\underset{⏟}{(X^{T} X)^{- 1} X^{T} X}} \hat{β} & = (X^{T} X)^{- 1} X^{T} y \\ I \hat{β} & = (X^{T} X)^{- 1} X^{T} y \\ \hat{β} & = (X^{T} X)^{- 1} X^{T} y \end{aligned}$

4 / 38

Linear Regression Review

Let's try to find an $X$ for which it would be impossible to calculate $\hat{β}$

5 / 38

Dr. Lucy D'Agostino McGowan

Calculating in R

y
x


4
1

3
2

1
5

3
1

5
5

6 / 38

y	x
4	1
3	2
1	5
3	1
5	5

Dr. Lucy D'Agostino McGowan

Creating a vector in R

y
x


4
1

3
2

1
5

3
1

5
5


y <- c(4, 3, 1, 3, 5)

7 / 38

y	x
4	1
3	2
1	5
3	1
5	5

Dr. Lucy D'Agostino McGowan

Creating a Design matrix in R

y
x


4
1

3
2

1
5

3
1

5
5


(X <- matrix(c(rep(1, 5), 
               c(1, 2, 5, 1, 5)),
             ncol = 2))

##      [,1] [,2]
## [1,]    1    1
## [2,]    1    2
## [3,]    1    5
## [4,]    1    1
## [5,]    1    5
8 / 38

y	x
4	1
3	2
1	5
3	1
5	5

Taking a transpose in R

t(X)

##      [,1] [,2] [,3] [,4] [,5]
## [1,]    1    1    1    1    1
## [2,]    1    2    5    1    5

9 / 38

Taking an inverse in R

XTX <- t(X) %*% X
solve(XTX)

##            [,1]        [,2]
## [1,]  0.6666667 -0.16666667
## [2,] -0.1666667  0.05952381

10 / 38

Put it all together

solve(t(X) %*% X) %*% t(X) %*% y

##            [,1]
## [1,]  3.5000000
## [2,] -0.1071429

11 / 38

`Application Exercise`

In R, find a design matrix X where it is not possible to calculate $\hat{β}$

solve(t(X) %*% X) %*% t(X) %*% y

05:00

12 / 38

Estimating $\hat{β}$

$\hat{β} = {(X}^{T} X)^{- 1} X^{T} y$

Under what circumstances is this equation not estimable?

13 / 38

Estimating $\hat{β}$

$\hat{β} = {(X}^{T} X)^{- 1} X^{T} y$

Under what circumstances is this equation not estimable?

when we can't invert $(X^{T} X)^{- 1}$

13 / 38

Estimating $\hat{β}$

$\hat{β} = {(X}^{T} X)^{- 1} X^{T} y$

Under what circumstances is this equation not estimable?

when we can't invert $(X^{T} X)^{- 1}$
$p > n$
multicollinearity

13 / 38

Estimating $\hat{β}$

$\hat{β} = {(X}^{T} X)^{- 1} X^{T} y$

Under what circumstances is this equation not estimable?

when we can't invert $(X^{T} X)^{- 1}$
$p > n$
multicollinearity

A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

13 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 2 & 3 & 1 \\ 1 & 3 & 4 & 0 \end{matrix}]$

What is $n$ here? What is $p$ ?

14 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 2 & 3 & 1 \\ 1 & 3 & 4 & 0 \end{matrix}]$

What is $n$ here? What is $p$ ?

Is $(X^{T} X)^{- 1}$ going to be invertible?

14 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 2 & 3 & 1 \\ 1 & 3 & 4 & 0 \end{matrix}]$

What is $n$ here? What is $p$ ?

Is $(X^{T} X)^{- 1}$ going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)

## [1] 0

14 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

15 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

15 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

## [1] 0

cor(X[, 2], X[, 3])

## [1] 1

15 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

## [1] 0

cor(X[, 2], X[, 3])

## [1] 1

16 / 38

Estimating $\hat{β}$

What is a sure-fire way to tell whether $(X^{T} X)^{- 1}$ will be invertible?

17 / 38

Estimating $\hat{β}$

What is a sure-fire way to tell whether $(X^{T} X)^{- 1}$ will be invertible?

Take the determinant!

17 / 38

Estimating $\hat{β}$

What is a sure-fire way to tell whether $(X^{T} X)^{- 1}$ will be invertible?

Take the determinant!

$| A |$ means the determinant of matrix $A$

17 / 38

Estimating $\hat{β}$

What is a sure-fire way to tell whether $(X^{T} X)^{- 1}$ will be invertible?

Take the determinant!

$| A |$ means the determinant of matrix $A$

For a 2x2 matrix:

$A = [\begin{matrix} a & b \\ c & d \end{matrix}]$ $| A | = a d - b c$

17 / 38

Estimating $\hat{β}$

What is a sure-fire way to tell whether $(X^{T} X)^{- 1}$ will be invertible?

Take the determinant!

$| A |$ means the determinant of matrix $A$

For a 3x3 matrix:

$A = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}]$ $| A | = a (e i - f h) - b (d i - f g) + c (d h - e g)$

18 / 38

Determinants

It looks funky, but it follows a nice pattern!

$A = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}]$ $| A | = a (e i - f h) - b (d i - f g) + c (d h - e g)$

19 / 38

Determinants

It looks funky, but it follows a nice pattern!

$A = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}]$ $| A | = a (e i - f h) - b (d i - f g) + c (d h - e g)$

(1) multiply $a$ by the determinant of the portion of the matrix that are not in $a$ 's row or column
do the same for $b$ (2) and $c$ (3)
put it together as plus (1) minus (2) plus (3)

19 / 38

Determinants

It looks funky, but it follows a nice pattern!

$A = [\begin{matrix} a & b & c \\ d & e & f \\ g & h & i \end{matrix}]$ $| A | = a (e i - f h) - b (d i - f g) + c (d h - e g)$

(1) multiply $a$ by the determinant of the portion of the matrix that are not in $a$ 's row or column
do the same for $b$ (2) and $c$ (3)
put it together as plus (1) minus (2) plus (3)

$| A | = a | \begin{matrix} e & f \\ h & i \end{matrix} | - b | \begin{matrix} d & f \\ g & i \end{matrix} | + c | \begin{matrix} d & e \\ g & h \end{matrix} |$

19 / 38

`Application Exercise`

Calculate the determinant of the following matrices in R using the det() function:

$A = [\begin{matrix} 1 & 2 \\ 4 & 5 \end{matrix}]$

$B = [\begin{matrix} 1 & 2 & 3 \\ 3 & 6 & 9 \\ 2 & 5 & 7 \end{matrix}]$

Are these both invertible?

01:00

20 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

21 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

21 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)

## [1] 0.0056

cor(X[, 2], X[, 3])

## [1] 0.999993

21 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y

##             [,1]
## [1,]    1.285714
## [2,] -114.285714
## [3,]   57.285714

22 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

Is $(X^{T} X)^{- 1}$ going to be invertible?

$[\begin{matrix} {\hat{β}}_{0} \\ {\hat{β}}_{1} \\ {\hat{β}}_{2} \end{matrix}] = [\begin{matrix} 1.28 \\ - 114.29 \\ 57.29 \end{matrix}]$

23 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

What is the equation for the variance of $\hat{β}$ ?

$v a r (\hat{β}) = {\hat{σ}}^{2} (X^{T} X)^{- 1}$

24 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

What is the equation for the variance of $\hat{β}$ ?

$v a r (\hat{β}) = {\hat{σ}}^{2} (X^{T} X)^{- 1}$

${\hat{σ}}^{2} = \frac{R S S}{n - (p + 1)}$

24 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

What is the equation for the variance of $\hat{β}$ ?

$v a r (\hat{β}) = {\hat{σ}}^{2} (X^{T} X)^{- 1}$

${\hat{σ}}^{2} = \frac{R S S}{n - (p + 1)}$

$v a r (\hat{β}) = [\begin{matrix} 0.91835 & - 24.489 & 12.132 \\ - 24.48943 & 4081.571 & - 2038.745 \\ 12.13247 & - 2038.745 & 1018.367 \end{matrix}]$

24 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

$v a r (\hat{β}) = [\begin{matrix} 0.91835 & - 24.489 & 12.132 \\ - 24.48943 & 4081.571 & - 2038.745 \\ 12.13247 & - 2038.745 & 1018.367 \end{matrix}]$

What is the variance for ${\hat{β}}_{0}$ ?

25 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

$v a r (\hat{β}) = [\begin{matrix} 0.91835 & - 24.489 & 12.132 \\ - 24.48943 & 4081.571 & - 2038.745 \\ 12.13247 & - 2038.745 & 1018.367 \end{matrix}]$

What is the variance for ${\hat{β}}_{0}$ ?

26 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

$v a r (\hat{β}) = [\begin{matrix} 0.91835 & - 24.489 & 12.132 \\ - 24.48943 & 4081.571 & - 2038.745 \\ 12.13247 & - 2038.745 & 1018.367 \end{matrix}]$

What is the variance for ${\hat{β}}_{1}$ ?

27 / 38

Estimating $\hat{β}$

$X = [\begin{matrix} 1 & 3.01 & 6 \\ 1 & 4 & 8 \\ 1 & 5 & 10 \\ 1 & 2 & 4 \end{matrix}]$

$v a r (\hat{β}) = [\begin{matrix} 0.91835 & - 24.489 & 12.132 \\ - 24.48943 & 4081.571 & - 2038.745 \\ 12.13247 & - 2038.745 & 1018.367 \end{matrix}]$

What is the variance for ${\hat{β}}_{1}$ ? 😱

28 / 38

What's the problem?

Sometimes we can't solve for $\hat{β}$

Why?

29 / 38

Dr. Lucy D'Agostino McGowan

What's the problem?Sometimes we can't solve for ^ββ^
XTXXTX is not invertible
30 / 38

Dr. Lucy D'Agostino McGowan

What's the problem?Sometimes we can't solve for ^ββ^
XTXXTX is not invertible
We have more variables than observations ( p>np>n )
The variables are linear combinations of one another
30 / 38

Dr. Lucy D'Agostino McGowan

What's the problem?Sometimes we can't solve for ^ββ^
XTXXTX is not invertible
We have more variables than observations ( p>np>n )
The variables are linear combinations of one another
Even when we can invert XTXXTX, things can go wrong
30 / 38

Dr. Lucy D'Agostino McGowan

What's the problem?Sometimes we can't solve for ^ββ^
XTXXTX is not invertible
We have more variables than observations ( p>np>n )
The variables are linear combinations of one another
Even when we can invert XTXXTX, things can go wrong
The variance can blow up, like we just saw!
30 / 38

Dr. Lucy D'Agostino McGowan

What can we do about this?31 / 38

Dr. Lucy D'Agostino McGowan

Ridge RegressionWhat if we add an additional penalty to keep the ^ββ^ coefficients small (this will keep the variance from blowing up!)
32 / 38

Dr. Lucy D'Agostino McGowan

Ridge RegressionWhat if we add an additional penalty to keep the ^ββ^ coefficients small (this will keep the variance from blowing up!)
Instead of minimizing RSSRSS, like we do with linear regresion, let's minimize RSSRSS PLUS some penalty function
32 / 38

Ridge Regression

What if we add an additional penalty to keep the $\hat{β}$ coefficients small (this will keep the variance from blowing up!)
Instead of minimizing $R S S$ , like we do with linear regresion, let's minimize $R S S$ PLUS some penalty function

$R S S + \underset{shrinkage penalty}{\underset{⏟}{λ \sum_{j = 1}^{p} β_{j}^{2}}}$

32 / 38

Ridge Regression

What if we add an additional penalty to keep the $\hat{β}$ coefficients small (this will keep the variance from blowing up!)
Instead of minimizing $R S S$ , like we do with linear regresion, let's minimize $R S S$ PLUS some penalty function

$R S S + \underset{shrinkage penalty}{\underset{⏟}{λ \sum_{j = 1}^{p} β_{j}^{2}}}$

What happens when $λ = 0$ ? What happens as $λ \to \infty$ ?

32 / 38

Ridge Regression

Let's solve for the $\hat{β}$ coefficients using Ridge Regression. What are we minimizing?

33 / 38

Ridge Regression

Let's solve for the $\hat{β}$ coefficients using Ridge Regression. What are we minimizing?

$(y - X β)^{T} (y - X β) + λ β^{T} β$

33 / 38

`Try it!`

Find $\hat{β}$ that minimizes this:

$(y - X β)^{T} (y - X β) + λ β^{T} β$

02:00

34 / 38

Ridge Regression

${\hat{β}}_{r i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

35 / 38

Ridge Regression

${\hat{β}}_{r i d g e} = (X^{T} X + λ I)^{- 1} X^{T} y$

Not only does this help with the variance, it solves our problem when $X^{T} X$ isn't invertible!

35 / 38

Dr. Lucy D'Agostino McGowan

Choosing λλλλ is known as a tuning parameter and is selected using cross validation
For example, choose the λλ that results in the smallest estimated test error
36 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As $λ$ ☝️, bias ☝️, variance 👇

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As $λ$ ☝️, bias ☝️, variance 👇
Bias( ${\hat{β}}_{r i d g e}$ ) = $- λ (X^{T} X + λ I)^{- 1} β$

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As $λ$ ☝️, bias ☝️, variance 👇
Bias( ${\hat{β}}_{r i d g e}$ ) = $- λ (X^{T} X + λ I)^{- 1} β$
What would this be if $λ$ was 0?

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As $λ$ ☝️, bias ☝️, variance 👇
Bias( ${\hat{β}}_{r i d g e}$ ) = $- λ (X^{T} X + λ I)^{- 1} β$
What would this be if $λ$ was 0?
Var( ${\hat{β}}_{r i d g e}$ ) = $σ^{2} (X^{T} X + λ I)^{- 1} X^{T} X (X^{T} X + λ I)^{- 1}$

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

As $λ$ ☝️, bias ☝️, variance 👇
Bias( ${\hat{β}}_{r i d g e}$ ) = $- λ (X^{T} X + λ I)^{- 1} β$
What would this be if $λ$ was 0?
Var( ${\hat{β}}_{r i d g e}$ ) = $σ^{2} (X^{T} X + λ I)^{- 1} X^{T} X (X^{T} X + λ I)^{- 1}$
Is this bigger or smaller than $σ^{2} (X^{T} X)^{- 1}$ ? What is this when $λ = 0$ ? As $λ \to \infty$ does this go up or down?

37 / 38

Dr. Lucy D'Agostino McGowan

Ridge RegressionIMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)
38 / 38

Ridge Regression

IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Why?

38 / 38

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

Ridge Regression

Dr. D’Agostino McGowan

Linear Regression Review

Linear Regression Review

Linear Regression Review

Linear Regression Review

Linear Regression Review

Linear Regression Review

Linear Regression Review

Linear Regression Review

Calculating in R

Creating a vector in R

Creating a Design matrix in R

Taking a transpose in R

Taking an inverse in R

Put it all together

Application Exercise

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Determinants

Determinants

Determinants

Application Exercise

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

Estimating ^ββ^

What's the problem?

What's the problem?

What's the problem?

What's the problem?

What's the problem?

What can we do about this?

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Ridge Regression

Try it!

Ridge Regression

Ridge Regression

Choosing λλ

Bias-variance tradeoff

Bias-variance tradeoff

Bias-variance tradeoff

Bias-variance tradeoff

Bias-variance tradeoff

Bias-variance tradeoff

Ridge Regression

Ridge Regression

Linear Regression Review

Help

`Application Exercise`

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

`Application Exercise`

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

Estimating $\hat{β}$

`Try it!`

Choosing $λ$