Let's look at a sample of 116 sparrows from Kent Island. We are interested in the relationship between Weight
and Wing Length
How can we quantify how much we'd expect the slope to differ from one random sample to another?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we interpret this?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we interpret this?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we know what values of this statistic are worth paying attention to?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we know what values of this statistic are worth paying attention to?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
How do we know what values of this statistic are worth paying attention to?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy(conf.int = TRUE)
## # A tibble: 2 x 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26 ## 2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536
Application Exercise
mtcars
data frame predicting miles per gallon (mpg
) from (wt
)tidy()
function demonstrated. How do you interpret these?04:00
How are these statistics distributed under the null hypothesis?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
The distribution of test statistics we would expect given the null hypothesis is true, \(\beta_1 = 0\), is t-distribution with n-2 degrees of freedom.
How can we compare this line to the distribution under the null?
How can we compare this line to the distribution under the null?
The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
The proportion of area less than 1.5
pt(1.5, df = 18)
## [1] 0.9245248
The proportion of area greater than 1.5
pt(1.5, df = 18, lower.tail = FALSE)
## [1] 0.07547523
The proportion of area greater than 1.5 or less than -1.5.
The proportion of area greater than 1.5 or less than -1.5.
pt(1.5, df = 18, lower.tail = FALSE) * 2
## [1] 0.1509505
The probability of getting a statistic as extreme or more extreme than the observed test statistic given the null hypothesis is true
Application Exercise
Using the linear model you fit previously (mpg
from wt
using the mtcars
data) - calculate the p-value for the coefficient for weight? Interpret this value. What is the null hypothesis? What is the alternative hypothesis? Do you reject the null?
04:00
If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.
\(\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\)
\(\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\)
\(\Huge \hat\beta_1 \pm t^∗ \times SE_{\hat\beta_1}\)
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy(conf.int = TRUE)
## # A tibble: 2 x 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26 ## 2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy(conf.int = TRUE)
## # A tibble: 2 x 7## term estimate std.error statistic p.value conf.low conf.high## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1 -0.531 3.26 ## 2 WingLength 0.467 0.0347 13.5 2.62e-25 0.399 0.536
If we use the same sampling method to select different samples and computed an interval estimate for each sample, we would expect the true population parameter ( \(\beta_1\) ) to fall within the interval estimates 95% of the time.
Using the information here, how could I predict a new sparrow's weight if I knew the wing length was 30?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
Using the information here, how could I predict a new sparrow's weight if I knew the wing length was 30?
linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows) %>% tidy()
## # A tibble: 2 x 5## term estimate std.error statistic p.value## <chr> <dbl> <dbl> <dbl> <dbl>## 1 (Intercept) 1.37 0.957 1.43 1.56e- 1## 2 WingLength 0.467 0.0347 13.5 2.62e-25
What is the residual sum of squares again?
What is the residual sum of squares again?
$$RSS = \sum(y_i - \hat{y}_i)^2$$
What is the residual sum of squares again?
$$RSS = \sum(y_i - \hat{y}_i)^2$$
$$TSS = \sum(y_i - \bar{y})^2$$
What could we use to determine whether at least one predictor is useful?
What could we use to determine whether at least one predictor is useful?
$$F = \frac{(TSS - RSS)/p}{RSS/(n-p-1)}\sim F_{p,n-p-1}$$ We can use a F-statistic!
lm_fit <- linear_reg() %>% set_engine("lm") %>% fit(Weight ~ WingLength, data = Sparrows)glance(lm_fit$fit)
## # A tibble: 1 x 12## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0.614 0.611 1.40 181. 2.62e-25 1 -203. 411. 419.## # … with 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
Application Exercise
Using the model previously fit (using the mtcars
data predicting miles per gallon from weight), pull out the F-statistic and \(R^2\) using the glance()
function. Interpret these values.
04:00
Refer to Chapter 3 for more details on these topics if you need a refresher.
Let's look at a sample of 116 sparrows from Kent Island. We are interested in the relationship between Weight
and Wing Length
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |