+ - 0:00:00
Notes for current slide
Notes for next slide

Ridge Regression

Dr. D’Agostino McGowan

1 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

What is the solution ( β^ ) to this?

2 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

  • RSS!

(yXβ^)T(yXβ^)

What is the solution ( β^ ) to this?

(XTX)1XTy

2 / 38

Linear Regression Review

What is X?

3 / 38

Linear Regression Review

What is X?

  • the design matrix!
3 / 38

Linear Regression Review

How did we get (XTX)1XTy?

2XTy+2XTXβ^=02XTXβ^=2XTyXTXβ^=XTy(XTX)1XTXβ^=(XTX)1XTy(XTX)1XTXIβ^=(XTX)1XTyIβ^=(XTX)1XTyβ^=(XTX)1XTy

4 / 38

Linear Regression Review

Let's try to find an X for which it would be impossible to calculate β^

5 / 38

Calculating in R

y x
4 1
3 2
1 5
3 1
5 5
6 / 38

Creating a vector in R

y x
4 1
3 2
1 5
3 1
5 5
y <- c(4, 3, 1, 3, 5)
7 / 38

Creating a Design matrix in R

y x
4 1
3 2
1 5
3 1
5 5
(X <- matrix(c(rep(1, 5),
c(1, 2, 5, 1, 5)),
ncol = 2))
## [,1] [,2]
## [1,] 1 1
## [2,] 1 2
## [3,] 1 5
## [4,] 1 1
## [5,] 1 5
8 / 38

Taking a transpose in R

t(X)
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 1 1 1 1
## [2,] 1 2 5 1 5
9 / 38

Taking an inverse in R

XTX <- t(X) %*% X
solve(XTX)
## [,1] [,2]
## [1,] 0.6666667 -0.16666667
## [2,] -0.1666667 0.05952381
10 / 38

Put it all together

solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## [1,] 3.5000000
## [2,] -0.1071429
11 / 38

Application Exercise

  • In R, find a design matrix X where it is not possible to calculate β^
solve(t(X) %*% X) %*% t(X) %*% y
05:00
12 / 38

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

13 / 38

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
13 / 38

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
  • p>n
  • multicollinearity
13 / 38

Estimating β^

β^=(XTX)1XTy

Under what circumstances is this equation not estimable?

  • when we can't invert (XTX)1
  • p>n
  • multicollinearity

A guaranteed way to check whether a square matrix is not invertible is to check whether the determinant is equal to zero

13 / 38

Estimating β^

X=[12311340]

What is n here? What is p?

14 / 38

Estimating β^

X=[12311340]

What is n here? What is p?

Is (XTX)1 going to be invertible?

14 / 38

Estimating β^

X=[12311340]

What is n here? What is p?

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 2, 3, 3, 4, 1, 0), nrow = 2)
det(t(X) %*% X)
## [1] 0
14 / 38

Estimating β^

X=[1361481510124]

15 / 38

Estimating β^

X=[1361481510124]

Is (XTX)1 going to be invertible?

15 / 38

Estimating β^

X=[1361481510124]

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0
cor(X[, 2], X[, 3])
## [1] 1
15 / 38

Estimating β^

X=[1361481510124]

What was the problem this time?

X <- matrix(c(1, 1, 1, 1, 3, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0
cor(X[, 2], X[, 3])
## [1] 1
16 / 38

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

17 / 38

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!
17 / 38

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

17 / 38

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

  • For a 2x2 matrix:

A=[abcd] |A|=adbc

17 / 38

Estimating β^

What is a sure-fire way to tell whether (XTX)1 will be invertible?

  • Take the determinant!

|A| means the determinant of matrix A

  • For a 3x3 matrix:

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

18 / 38

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

19 / 38

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

  • (1) multiply a by the determinant of the portion of the matrix that are not in a's row or column
  • do the same for b (2) and c (3)
  • put it together as plus (1) minus (2) plus (3)
19 / 38

Determinants

It looks funky, but it follows a nice pattern!

A=[abcdefghi] |A|=a(eifh)b(difg)+c(dheg)

  • (1) multiply a by the determinant of the portion of the matrix that are not in a's row or column
  • do the same for b (2) and c (3)
  • put it together as plus (1) minus (2) plus (3)

|A|=a|efhi|b|dfgi|+c|degh|

19 / 38

Application Exercise

  • Calculate the determinant of the following matrices in R using the det() function:

A=[1245]

B=[123369257]

  • Are these both invertible?
01:00
20 / 38

Estimating β^

X=[13.0161481510124]

21 / 38

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

21 / 38

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

X <- matrix(c(1, 1, 1, 1, 3.01, 4, 5, 2, 6, 8, 10, 4), nrow = 4)
det(t(X) %*% X)
## [1] 0.0056
cor(X[, 2], X[, 3])
## [1] 0.999993
21 / 38

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

y <- c(1, 2, 3, 2)
solve(t(X) %*% X) %*% t(X) %*% y
## [,1]
## [1,] 1.285714
## [2,] -114.285714
## [3,] 57.285714
22 / 38

Estimating β^

X=[13.0161481510124]

Is (XTX)1 going to be invertible?

[β^0β^1β^2]=[1.28114.2957.29]

23 / 38

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ^2(XTX)1

24 / 38

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ^2(XTX)1

  • σ^2=RSSn(p+1)
24 / 38

Estimating β^

X=[13.0161481510124]

What is the equation for the variance of β^?

var(β^)=σ^2(XTX)1

  • σ^2=RSSn(p+1)

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

24 / 38

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^0?

25 / 38

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^0?

26 / 38

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^1?

27 / 38

Estimating β^

X=[13.0161481510124]

var(β^)=[0.9183524.48912.13224.489434081.5712038.74512.132472038.7451018.367]

What is the variance for β^1? 😱

28 / 38

What's the problem?

  • Sometimes we can't solve for β^

Why?

29 / 38

What's the problem?

  • Sometimes we can't solve for β^
  • XTX is not invertible
30 / 38

What's the problem?

  • Sometimes we can't solve for β^
  • XTX is not invertible
  • We have more variables than observations ( p>n )
  • The variables are linear combinations of one another
30 / 38

What's the problem?

  • Sometimes we can't solve for β^
  • XTX is not invertible
  • We have more variables than observations ( p>n )
  • The variables are linear combinations of one another
  • Even when we can invert XTX, things can go wrong
30 / 38

What's the problem?

  • Sometimes we can't solve for β^
  • XTX is not invertible
  • We have more variables than observations ( p>n )
  • The variables are linear combinations of one another
  • Even when we can invert XTX, things can go wrong
  • The variance can blow up, like we just saw!
30 / 38

What can we do about this?

31 / 38

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
32 / 38

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function
32 / 38

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function

RSS+λj=1pβj2shrinkage penalty

32 / 38

Ridge Regression

  • What if we add an additional penalty to keep the β^ coefficients small (this will keep the variance from blowing up!)
  • Instead of minimizing RSS, like we do with linear regresion, let's minimize RSS PLUS some penalty function

RSS+λj=1pβj2shrinkage penalty

What happens when λ=0? What happens as λ?

32 / 38

Ridge Regression

Let's solve for the β^ coefficients using Ridge Regression. What are we minimizing?

33 / 38

Ridge Regression

Let's solve for the β^ coefficients using Ridge Regression. What are we minimizing?

(yXβ)T(yXβ)+λβTβ

33 / 38

Try it!

  • Find β^ that minimizes this:

(yXβ)T(yXβ)+λβTβ

02:00
34 / 38

Ridge Regression

β^ridge=(XTX+λI)1XTy

35 / 38

Ridge Regression

β^ridge=(XTX+λI)1XTy

  • Not only does this help with the variance, it solves our problem when XTX isn't invertible!
35 / 38

Choosing λ

  • λ is known as a tuning parameter and is selected using cross validation
  • For example, choose the λ that results in the smallest estimated test error
36 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β
37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

  • Var( β^ridge ) = σ2(XTX+λI)1XTX(XTX+λI)1
37 / 38

Bias-variance tradeoff

How do you think ridge regression fits into the bias-variance tradeoff?

  • As λ ☝️, bias ☝️, variance 👇
  • Bias( β^ridge ) = λ(XTX+λI)1β

    What would this be if λ was 0?

  • Var( β^ridge ) = σ2(XTX+λI)1XTX(XTX+λI)1

    Is this bigger or smaller than σ2(XTX)1? What is this when λ=0? As λ does this go up or down?

37 / 38

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)
38 / 38

Ridge Regression

  • IMPORTANT: When doing ridge regression, it is important to standardize your variables (divide by the standard deviation)

Why?

38 / 38

Linear Regression Review

In linear regression, what are we minimizing? How can I write this in matrix form?

2 / 38
Paused

Help

Keyboard shortcuts

, , Pg Up, k Go to previous slide
, , Pg Dn, Space, j Go to next slide
Home Go to first slide
End Go to last slide
Number + Return Go to specific slide
b / m / f Toggle blackout / mirrored / fullscreen mode
c Clone slideshow
p Toggle presenter mode
t Restart the presentation timer
?, h Toggle this help
Esc Back to slideshow