Linear Regression, Clearly Explained!!!

StatQuest with Josh Starmer
18 Nov 202227:26

Summary

TLDRThis video, part of the 'StatQuest' series, offers a detailed explanation of linear regression, a powerful statistical method used to fit a line to data and predict outcomes. The video covers the key concepts of linear regression, including least squares, R-squared, and the calculation of p-values, which help assess the strength and significance of the model. Through engaging examples, the video illustrates how these concepts work together to quantify relationships in data, making complex statistical ideas accessible and understandable.

Takeaways

  • πŸ“Š **Linear Regression Basics**: The video introduces linear regression, explaining that the primary steps involve using least squares to fit a line to data, calculating R-squared, and then determining a p-value for R-squared.
  • πŸ“‰ **Fitting a Line**: Linear regression uses the least squares method to find the best-fitting line for the data, minimizing the sum of squared residuals, which are the differences between observed and predicted values.
  • πŸ” **Understanding R-squared**: R-squared is a statistical measure that represents the proportion of variance for a dependent variable that's explained by an independent variable or variables in a regression model.
  • 🐭 **Example with Mice**: The video uses a dataset of mice, where the aim is to predict mouse size based on weight, showcasing how linear regression can be applied to real-world data.
  • βš–οΈ **Sum of Squares**: The sum of squares around the mean is compared to the sum of squares around the fitted line, which helps calculate R-squared by showing how much variance is explained by the model.
  • πŸ”„ **Multiple Parameters**: When multiple predictors are used (e.g., mouse weight and tail length), linear regression fits a plane rather than a line, showing that more complex relationships can be modeled.
  • πŸ“ˆ **Adjusted R-squared**: The concept of adjusted R-squared is introduced, which accounts for the number of predictors in the model, providing a more accurate measure when multiple variables are involved.
  • πŸ€” **Limitations of R-squared**: R-squared alone doesn't indicate whether a model is statistically significant, especially with small datasets or when fitting models to random data points.
  • πŸ”¬ **Calculating p-values**: The video explains the process of calculating p-values using the F-statistic, which compares the variance explained by the model to the variance not explained, helping determine statistical significance.
  • πŸ§ͺ **Practical Application**: Linear regression is highlighted as a powerful tool for quantifying relationships in data, but it requires both a high R-squared value and a low p-value to confirm the reliability and significance of results.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is linear regression, specifically general linear models, also known as multiple regression.

  • What are the three most important concepts behind linear regression mentioned in the script?

    -The three most important concepts behind linear regression mentioned in the script are: fitting a line to the data using least squares, calculating the R-squared value, and calculating a p-value for R-squared.

  • What is the purpose of using least squares in linear regression?

    -The purpose of using least squares in linear regression is to find the best-fitting line for the data by minimizing the sum of the squares of the vertical distances (residuals) between the observed data points and the fitted line.

  • What does the R-squared value represent in the context of linear regression?

    -The R-squared value represents the proportion of the variance in the dependent variable that is predictable from the independent variable(s). It indicates the strength of the relationship between the variables.

  • How is the R-squared value calculated in the script's example with mouse size and weight?

    -The R-squared value is calculated by taking the difference between the variation around the mean and the variation around the fit, then dividing that by the variation around the mean. In the script's example, the R-squared value is 0.6, indicating that 60% of the variation in mouse size can be explained by mouse weight.

  • What is the significance of calculating a p-value for R-squared in linear regression?

    -Calculating a p-value for R-squared is important to determine if the observed R-squared value is statistically significant, which helps to assess whether the relationship between the variables is likely due to chance or a true association.

  • What does the term 'residual' mean in the context of linear regression?

    -In the context of linear regression, a 'residual' refers to the difference between the observed value and the value predicted by the regression line for a given data point.

  • How does the script explain the concept of degrees of freedom in the context of calculating the p-value for R-squared?

    -The script explains that degrees of freedom are used to turn the sums of squares into variances in the context of calculating the p-value for R-squared. They are related to the number of parameters in the fit line and the mean line, and they are used to adjust the sums of squares to account for the number of data points and parameters.

  • What is the role of the F-distribution in calculating the p-value for R-squared?

    -The F-distribution is used to approximate the histogram of F-scores that would be obtained if many random datasets were generated and analyzed in the same way. The p-value is then determined by comparing the F-score from the original dataset to this distribution to see how extreme it is relative to the expected distribution of F-scores.

  • How does the script illustrate the potential issue with adding too many parameters to a regression model?

    -The script illustrates the potential issue by pointing out that adding more parameters to a model can lead to a higher R-squared value due to random chance, even if those parameters do not have a true relationship with the dependent variable. This is why adjusted R-squared and p-values are important to ensure the model's validity.

  • What is the adjusted R-squared and why is it used in regression analysis?

    -Adjusted R-squared is a modified version of R-squared that adjusts for the number of predictors in the model. It is used to provide a more accurate measure of the model's goodness of fit, especially when comparing models with different numbers of predictors, as it penalizes the addition of unnecessary predictors.

Outlines

00:00

🚒 Introduction to Linear Regression and Its Core Concepts

This section introduces the topic of linear regression and provides a brief overview of the StatQuest video series. The speaker begins by setting the stage with a metaphorical journey to StatQuest and then delves into the essentials of linear regression, focusing on fitting a line to data using the least squares method, calculating R-squared, and determining the p-value for R-squared. The importance of understanding these concepts for linear regression is emphasized, with a promise of detailed explanations and examples throughout the video.

05:02

πŸ“ Understanding Variance and Sum of Squares in Linear Regression

This paragraph explains the concept of variance in the context of linear regression, particularly how variance is calculated around the mean and the fitted line. It introduces the terms SS (Sum of Squares) and variance, explaining how they relate to the overall variation in a data set. The section emphasizes the importance of understanding these concepts as they are fundamental to assessing the fit of a regression line and calculating key metrics like R-squared.

10:06

πŸ”„ Examples of R-squared in Linear Regression

Here, the speaker provides three distinct examples to illustrate the concept of R-squared in linear regression. Each example shows different levels of predictive power based on mouse weight and size. The first example demonstrates a 60% reduction in variance, the second shows a perfect prediction with 100% R-squared, and the third illustrates a scenario where mouse weight does not help in predicting mouse size, resulting in an R-squared of 0%. These examples help in understanding how R-squared quantifies the proportion of variation explained by the model.

15:07

πŸ“Š The Impact of Adding Parameters and Adjusted R-squared

This section explores the effects of adding more parameters to a regression model, emphasizing that while adding parameters can improve R-squared, it may not always be meaningful. The concept of adjusted R-squared is introduced, which accounts for the number of parameters, thus providing a more accurate measure of model performance. The discussion also touches on the potential for random factors to influence the sum of squares, leading to the need for adjusted metrics.

20:08

πŸ“‰ Degrees of Freedom and F-statistic in Linear Regression

This paragraph explains the concepts of degrees of freedom and the F-statistic, which are crucial for determining the significance of an R-squared value. The degrees of freedom are related to the number of parameters in the model, and the F-statistic compares the explained variance to the unexplained variance. This section breaks down the calculation of the F-statistic and how it is used to determine the p-value, which indicates the reliability of the regression model.

25:09

βœ… Final Review and Importance of R-squared and P-values

In this final section, the speaker reviews the key concepts of R-squared and p-values in linear regression. The importance of having both a high R-squared and a low p-value for a meaningful and reliable model is stressed. The section concludes by summarizing the process of calculating R-squared and p-values, and the overall significance of these metrics in evaluating the strength and reliability of a linear regression model.

Mindmap

Keywords

πŸ’‘Linear Regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In the video, it is the primary focus, with the process of fitting a line to data points using the least squares method being a core concept. The script describes how linear regression is used to predict mouse size based on weight, illustrating the technique's application in a real-world context.

πŸ’‘Least Squares

Least squares is a mathematical procedure used to find the line of best fit by minimizing the sum of the squares of the vertical distances (residuals) of the data points from the line. The script explains that this method is used to determine the best-fitting line for a set of data, which is crucial for linear regression analysis.

πŸ’‘R-squared

R-squared, or the coefficient of determination, is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable or variables in a regression model. The script discusses how to calculate R-squared to determine the goodness of fit for the regression model, using it to quantify how much of the variation in mouse size can be explained by mouse weight.

πŸ’‘P-value

A p-value is the probability that the observed results of a statistical hypothesis test occurred by chance. In the context of the video, the p-value for R-squared is calculated to assess the statistical significance of the relationship between mouse weight and size. The script explains that a small p-value indicates strong evidence against the null hypothesis, suggesting that the relationship is statistically significant.

πŸ’‘Residual

In regression analysis, a residual is the difference between the observed values and the values predicted by a model. The script mentions residuals as the distances from the data points to the fitted line, which are squared and summed to find the best-fitting line using the least squares method.

πŸ’‘Variance

Variance is a measure of the dispersion of a set of data points around their mean. The script uses variance to describe the spread of the data points and how much they vary from the mean or the fitted line. It explains how variance is calculated and used in determining R-squared and the p-value.

πŸ’‘Fitting a Line

Fitting a line refers to the process of determining the best line that represents the data in a scatter plot, typically done through linear regression. The script describes the iterative process of rotating the line to minimize the sum of squared residuals, which is essential for finding the least squares fit.

πŸ’‘Sum of Squares

The sum of squares, particularly the sum of squares around the mean (SS mean) and around the fit (SS fit), is a key component in calculating R-squared. The script explains that these sums are used to measure the total variability of the data and the unexplained variability after fitting the regression line, respectively.

πŸ’‘Degrees of Freedom

Degrees of freedom in statistics refer to the number of values that are free to vary in a calculation. In the script, degrees of freedom are used in the context of calculating variance and are important for understanding how the sums of squares are converted into variances, which are then used to calculate the F statistic and the p-value.

πŸ’‘F Statistic

The F statistic is used in hypothesis testing to determine whether there is a significant difference between group means. In the context of the video, the F statistic is derived from the ratio of variances explained by the model to those not explained, and it is used to calculate the p-value for R-squared, helping to assess the significance of the regression model.

Highlights

StatQuest introduces the concept of linear regression, also known as general linear models, as a powerful statistical tool.

The primary objective of linear regression is to fit a line to data using the least squares method.

Calculating R-squared is essential to measure the goodness of fit of the regression line.

R-squared quantifies the percentage of variance explained by the model.

The video explains the concept of residuals, which are the distances between the data points and the regression line.

Least squares fitting involves rotating the line to minimize the sum of squared residuals.

The equation of the regression line is derived from the least squares method, estimating parameters like slope and y-intercept.

R-squared can be calculated using both the sum of squares around the mean and the sum of squares around the fit.

An R-squared value of 1 indicates a perfect prediction model, while 0 suggests no explanatory power.

The video demonstrates how to apply R-squared to more complex models involving multiple variables.

Least squares fitting can ignore irrelevant parameters by setting their coefficients to zero.

Adjusted R-squared is introduced to account for the number of parameters in the model and avoid overfitting.

The concept of degrees of freedom is discussed in the context of turning sums of squares into variances.

The video explains how to calculate the F-statistic, which is used to determine the statistical significance of R-squared.

The F-distribution is used to approximate the histogram of F-statistics for calculating p-values.

A small p-value indicates that the observed R-squared is statistically significant and not due to random chance.

The video concludes by emphasizing the importance of both a high R-squared and a low p-value for a meaningful linear regression model.

Transcripts

play00:01

sailing on a boat headed towards

play00:04

statquest

play00:06

join me on this boat let's go to stat

play00:09

Quest it's super cool

play00:13

hello and welcome to stat Quest

play00:16

stat Quest is brought to you by the

play00:18

friendly folks in the genetics

play00:19

department at the University of North

play00:21

Carolina at Chapel Hill

play00:23

today we're going to be talking about

play00:25

linear regression AKA General linear

play00:28

models part one there's a lot of parts

play00:31

to linear models but it's a really cool

play00:33

and Powerful concept so let's get right

play00:36

down to it

play00:38

I promise you I have lots and lots of

play00:40

slides that talk about all the Nitty

play00:42

Gritty details behind linear regression

play00:44

but first let's talk about the main

play00:47

ideas behind it

play00:49

the first thing you do in linear

play00:50

regression is use least squares to fit a

play00:53

line to the data

play00:55

the second thing you do is calculate r

play00:58

squared

play00:59

lastly calculate a p-value for r squared

play01:04

there are lots of other little things

play01:06

that come up along the way but these are

play01:08

the three most important Concepts behind

play01:11

linear regression

play01:12

in the stat Quest fitting a line to data

play01:16

we talked about

play01:18

fitting a line to data duh

play01:21

but let's do a quick review

play01:23

I'm going to introduce some new

play01:25

terminology in this part of the video so

play01:27

it's worth watching even if you've

play01:29

already seen the earlier stat Quest

play01:31

that said if you need more details check

play01:35

that stat Quest out

play01:37

for this review we're going to be

play01:39

talking about a data set where we took a

play01:41

bunch of mice and we measured their size

play01:43

and we measured their weight

play01:46

our goal is to use mouse weight as a way

play01:50

to predict Mouse size

play01:53

first draw a line through the data

play01:57

second measure the distance from the

play02:00

line to the data Square each distance

play02:02

and then add them up

play02:04

terminology alert

play02:06

the distance from the line to the data

play02:09

point is called a residual

play02:12

third rotate the line a little bit

play02:16

with the new line measure the residuals

play02:19

Square them and then sum up the squares

play02:23

now rotate the line a little bit more

play02:27

sum up the squared residuals

play02:30

etc etc etc we rotate and then sum up

play02:33

the squared residuals rotate then sum up

play02:36

the squared residuals just keep doing

play02:37

that

play02:39

after a bunch of rotations you can plot

play02:42

the sum of squared residuals and

play02:44

corresponding rotation

play02:46

so in this graph we have the sum of

play02:48

squared residuals on the y-axis and the

play02:51

different rotations on the x-axis

play02:55

lastly you find the rotation that has

play02:58

the least sum of squares

play03:00

more details about how this is actually

play03:03

done in practice are provided in the

play03:06

stat Quest on fitting a line to data

play03:08

so we see that this rotation is the one

play03:12

with the least squares so it will be the

play03:15

one to fit to the data

play03:18

this is our least squares rotation

play03:21

superimposed on the original data

play03:24

bam now we know why the method for

play03:27

fitting a line is called least squares

play03:31

now we have fit a line to the data this

play03:34

is awesome

play03:36

here's the equation for the line

play03:38

least squares estimated two parameters

play03:42

a y-axis intercept

play03:46

and a slope

play03:48

since the slope is not zero it means

play03:51

that knowing a mouse's weight will help

play03:53

us make a guess about that Mouse's size

play03:57

how good is that guess

play04:00

calculating r squared is the first step

play04:03

in determining how good that guess will

play04:06

be

play04:07

the stat Quest r squared explained talks

play04:11

about you got it r squared

play04:15

let's do a quick review I'm also going

play04:18

to introduce some additional terminology

play04:20

so it's worth watching this part of the

play04:22

video even if you've seen the original

play04:24

stat Quest on r squared

play04:27

first calculate the average Mouse size

play04:31

okay I've just shifted all the data

play04:34

points to the y-axis to emphasize that

play04:37

at this point we are only interested in

play04:40

Mouse size

play04:42

here I've drawn a black line to show the

play04:45

average Mouse size

play04:48

bam

play04:51

sum the squared residuals

play04:53

just like in least squares we measure

play04:56

the distance from the mean to the data

play04:59

point and square it and then add those

play05:01

squares together

play05:03

terminology alert we'll call this SS

play05:07

mean for sum of squares around the mean

play05:12

note the sum of squares around the mean

play05:15

equals the data minus the mean squared

play05:20

the variation around the mean equals the

play05:23

data minus the mean squared divided by n

play05:27

n is the sample size in this case n

play05:30

equals 9.

play05:32

the shorthand notation is the variation

play05:35

around the mean equals the sum of

play05:37

squares around the mean divided by n the

play05:41

sample size

play05:42

another way to think about variance is

play05:45

as the average sum of squares per Mouse

play05:49

now go back to the original plot and sum

play05:52

up the squared residuals around our

play05:55

least squares fit

play05:57

we'll call This Ss fit for the sum of

play06:01

squares around the least squares fit

play06:04

the sum of squares around the least

play06:06

squares fit is the sum of the distances

play06:10

between the data and the line squared

play06:14

just like with the mean the variance

play06:17

around the fit

play06:18

is the distance between the line and the

play06:21

data squared divided by n the sample

play06:24

size

play06:25

the shorthand is the variation around

play06:28

the fitted line equals the sum of

play06:30

squares around the fitted line divided

play06:33

by n the sample size

play06:36

again we can think of the variation

play06:38

around the fit as the average of the sum

play06:41

of squares around the fit for each Mouse

play06:44

in general the variance of something

play06:47

equals the sum of squares divided by the

play06:50

number of those things

play06:52

in other words it's an average of sum of

play06:55

squares

play06:56

I mentioned this because it's going to

play06:58

come in handy in a little bit so keep it

play07:00

in the back of your mind

play07:02

okay let's step back a little bit this

play07:06

is the raw variation in Mouse size

play07:10

and this is the variation around the

play07:12

least squares line

play07:14

there's less variation around the line

play07:16

that we fit by least squares that is to

play07:19

say the residuals are smaller

play07:22

as a result we say that some of the

play07:25

variation in Mouse size is explained by

play07:29

taking mouse weight into account

play07:32

in other words heavier mice are bigger

play07:34

lighter mice are smaller

play07:37

r squared tells us how much of the

play07:40

variation in Mouse size can be explained

play07:43

by taking mouse weight into account

play07:47

this is the formula for r squared it's

play07:50

the variation around the mean minus the

play07:52

variation around the fit divided by the

play07:55

variation around the mean

play07:58

let's look at an example

play08:00

in this example the variation around the

play08:03

mean equals 11.1 and the variation

play08:07

around the fit equals 4.4

play08:10

so we plug those numbers into the

play08:12

equation

play08:13

the result is that r squared equals 0.6

play08:18

which is the same thing as saying 60

play08:21

percent

play08:22

this means there is a sixty percent

play08:24

reduction in the variance when we take

play08:27

the mouse weight into account

play08:30

alternatively we can say that mouse

play08:33

weight explains 60 percent of the

play08:36

variation in Mouse size

play08:39

we can also use the sum of squares to

play08:41

make the same calculation

play08:43

this is because when we're talking about

play08:45

variation everything's divided by n the

play08:48

sample size since everything's scaled by

play08:51

n we can pull that term out and just use

play08:54

the raw sum of squares

play08:56

in this case the sum of squares around

play08:59

the mean equals one hundred

play09:01

and the sum of squares around the fit

play09:03

equals 40. plugging those numbers into

play09:06

the equation gives us the same value we

play09:09

had before r squared equals 0.6 which

play09:13

equals 60 percent

play09:15

60 percent of the sums of squares of the

play09:19

mouse size can be explained by mouse

play09:21

weight

play09:23

here's another example we're also going

play09:25

to go back to using variation in the

play09:27

calculation since that's more common

play09:30

in this case knowing mouse weight means

play09:34

you can make a perfect prediction of

play09:36

mouse size

play09:38

the variation around the mean is the

play09:41

same as it was before 11.1

play09:44

but now the variation around the fitted

play09:46

line equals zero because there are no

play09:48

residuals

play09:50

plugging the numbers in gives us an r

play09:52

squared equal to one which equals one

play09:55

hundred percent

play09:57

in this case mouse weight explains 100

play10:01

percent of the variation in Mouse size

play10:05

okay one last example

play10:09

in this case knowing mouse weight

play10:11

doesn't help us predict Mouse size

play10:15

if someone tells us they have a heavy

play10:17

Mouse well that Mouse could either be

play10:19

small or large with equal probability

play10:23

similarly if someone said they had a

play10:26

light Mouse well again we wouldn't know

play10:28

if it was a big mouse or a small Mouse

play10:30

because each of those options is equally

play10:33

likely

play10:34

just like the other two examples the

play10:37

variation around the mean is equal 11.1

play10:41

however in this case the variation

play10:43

around the fit is also equal 11.1

play10:47

so we plug those numbers in and we get r

play10:50

squared equals 0 which equals zero

play10:52

percent

play10:54

in this case mouse weight doesn't

play10:56

explain any of the variation around the

play10:59

mean

play11:00

when calculating the sum of squares

play11:03

around the mean we collapse the points

play11:05

onto the y-axis just to emphasize the

play11:08

fact that we were ignoring mouse weight

play11:12

but we could just as easily draw a line

play11:15

y equals the mean Mouse size and

play11:19

calculate the sum of squares around the

play11:21

mean around that

play11:24

in this example we applied r squared to

play11:27

a simple equation for a line Y equals

play11:31

0.1 plus 0.78 times x

play11:36

this gave us an r squared of 60 percent

play11:39

meaning 60 percent of the variation in

play11:42

Mouse size could be explained by mouse

play11:45

weight

play11:46

but the concept applies to any equation

play11:49

no matter how complicated

play11:51

first you measure square and sum the

play11:55

distance from the data to the mean

play11:57

then measure square and sum the distance

play12:00

from the data to the complicated

play12:02

equation

play12:04

once you've got those two sums of

play12:05

squares just plug them in and you've got

play12:08

r squared

play12:10

let's look at a slightly more

play12:12

complicated example

play12:14

imagine we wanted to know if mouse

play12:17

weight and tail length did a good job

play12:19

predicting the length of the mouse's

play12:21

body

play12:23

so we measure a bunch of mice

play12:26

to plot this data we need a

play12:28

three-dimensional graph

play12:31

we want to know how well weight and tail

play12:34

length predict body length

play12:37

the first Mouse we measured had weight

play12:40

equals 2.1

play12:42

tail length equals 1.3 and body length

play12:46

equals 2.5

play12:49

so that's how we plot this data on this

play12:51

3D graph

play12:53

here's all the data in the graph the

play12:56

larger circles are points that are

play12:58

closer to us and represent mice that

play13:01

have shorter tails

play13:03

the smaller circles are points that are

play13:05

further from us and represent mice with

play13:08

longer tails

play13:10

now we do a least squares fit

play13:13

since we have the extra term in the

play13:15

equation representing an extra Dimension

play13:18

we fit a plane instead of a line

play13:22

here's the equation for the plane

play13:25

the Y value represents body length

play13:29

least squares estimates three different

play13:32

parameters

play13:34

the first is the y-intercept that's when

play13:37

both tail length and mouse weight are

play13:39

equal to zero

play13:41

the second parameter 0.7 is for the

play13:45

mouse weight

play13:46

the last term

play13:48

0.5 is for the tail length

play13:52

if we know a mouse's weight and tail

play13:55

length we can use the equation to guess

play13:58

the body length

play14:00

for example given the weight and tail

play14:03

length for this mouse

play14:05

the equation predicts this body length

play14:08

just like before we can measure the

play14:11

residuals Square them and then add them

play14:14

up to calculate r squared

play14:17

now if the tail length or the z-axis is

play14:21

useless and doesn't make the sum of

play14:23

squares fit any smaller than least

play14:26

squares will ignore it by making that

play14:29

parameter equal to zero in this case

play14:32

plugging the tail length into the

play14:34

equation would have no effect on

play14:36

predicting the mouse size

play14:39

this means equations with more

play14:41

parameters will never make the sum of

play14:44

squares around the fit worse than

play14:46

equations with fewer parameters

play14:49

in other words this equation Mouse size

play14:53

equals 0.3 plus mouse weight plus flip

play14:58

of a coin plus favored color plus

play15:00

astrological sign plus extra stuff will

play15:04

never perform worse than this equation

play15:06

Mouse size equals 0.3 plus mouse weight

play15:11

this is because least squares will cause

play15:13

any term that makes sum of squares

play15:15

around the fit worse to be multiplied by

play15:18

zero and in a sense no longer exist

play15:22

now due to random chance there is a

play15:26

small probability that the small mice in

play15:29

the data set might get heads more

play15:31

frequently than large mice

play15:34

if this happened then we'd get a smaller

play15:37

sum of squares fit and a better r

play15:39

squared

play15:42

here's the frowny face of sad times

play15:47

the more silly parameters we add to the

play15:50

equation the more opportunities we have

play15:52

for random events to reduce sum of

play15:55

squares fit and result in a better r

play15:57

squared

play15:59

thus people report an adjusted r squared

play16:02

value that in essence scales are squared

play16:06

by the number of parameters

play16:08

r squared is awesome

play16:11

but it's missing something

play16:13

what if all we had were two measurements

play16:17

we'd calculate the sum of squares around

play16:19

the mean in this case that would be 10

play16:23

then we'd calculate the sum of squares

play16:25

around the fit which equals zero

play16:29

the sum of squares around the fit equals

play16:31

zero because you can always draw a

play16:34

straight line to connect any two points

play16:37

what this means is when we calculate r

play16:40

squared by plugging the numbers in we're

play16:43

going to get 100 percent

play16:45

100 percent is a great number we've

play16:48

explained all the variation but any two

play16:51

random points will give us the exact

play16:53

same thing it doesn't actually mean

play16:55

anything

play16:57

we need a way to determine if the r

play17:00

squared value is statistically

play17:02

significant

play17:04

we need a p-value

play17:06

before we calculate the p-value let's

play17:09

review the main Concepts behind r

play17:11

squared one last time

play17:14

the general equation for r squared is

play17:17

the variance around the mean minus the

play17:19

variance around the fit divided by the

play17:22

variance around the mean

play17:24

in our example this means the variation

play17:27

in the mouse size minus the variation

play17:30

after taking weight into account divided

play17:33

by the variation in Mouse size

play17:36

in other words r squared equals the

play17:39

variation in Mouse size explained by

play17:41

weight divided by the variation in Mouse

play17:44

size without taking weight into account

play17:48

in this particular example r squared

play17:51

equals 0.6 meaning we saw a 60 reduction

play17:55

in variation once we took mouse weight

play17:58

into account

play18:00

now that we have a thorough

play18:01

understanding of the ideas behind r

play18:04

squared let's talk about the main ideas

play18:06

behind calculating a p-value for it

play18:10

the p-value for r squared comes from

play18:13

something called f

play18:16

f is equal to the variation in Mouse

play18:19

size explained by weight divided by the

play18:23

variation in Mouse size not explained by

play18:26

weight

play18:27

the numerators for r squared and for f

play18:30

are the same

play18:33

that is to say it's the reduction in

play18:35

variance when we take the weight into

play18:38

account

play18:39

the denominator is a little different

play18:42

these dotted lines the residuals

play18:45

represent the variation that remains

play18:48

after fitting the line

play18:50

this is the variation that is not

play18:52

explained by weight

play18:54

so together we have the variation in

play18:57

Mouse size explained by weight divided

play19:00

by the variation in Mouse size not

play19:03

explained by weight

play19:05

now let's look at the underlying

play19:07

mathematics

play19:08

just as a reminder here's the equation

play19:11

for r squared

play19:13

this is the general equation that will

play19:16

tell us if r squared is significant

play19:19

the meat of these two equations are very

play19:22

similar and rely on the same sums of

play19:25

squares

play19:26

like we said before the numerators are

play19:29

the same

play19:30

in our Mouse size and weight example the

play19:34

numerator is the variation in Mouse size

play19:36

explained by weight

play19:39

and the sum of squares around the fit is

play19:42

just the residuals squared and summed up

play19:44

around the fitted line so that's the

play19:47

variation that the fit does not explain

play19:51

these numbers over here are the degrees

play19:53

of freedom

play19:55

they turn the sums of squares into

play19:57

variances

play19:59

I'm going to dedicate a whole stat quest

play20:02

to degrees of freedom but for now let's

play20:05

see if we can get an intuitive feel for

play20:07

what they're doing here

play20:10

let's start with these

play20:13

P fit is the number of parameters in the

play20:16

fit line

play20:18

here's the equation for the fit line in

play20:20

a general format we just have the

play20:22

y-intercept plus the slope times x

play20:26

the y-intercept and the slope are two

play20:29

separate parameters

play20:31

that means P fit equals two

play20:35

p mean is the number of parameters in

play20:38

the mean line

play20:40

in general that equation is y equals the

play20:44

y-intercept

play20:45

that's what gives us a horizontal line

play20:47

that cuts through the data

play20:49

in this case the y-intercept is the mean

play20:53

value

play20:54

this equation just has one parameter

play20:58

thus p mean equals one

play21:02

both equations have a parameter for the

play21:04

y-intercept

play21:07

however the fit line has one extra

play21:10

parameter the slope in our example this

play21:14

slope is the relationship between weight

play21:16

and size

play21:19

in this example P fit minus p mean

play21:22

equals 2 minus 1 which equals one

play21:27

the fit has one extra parameter mouse

play21:30

weight

play21:32

thus the numerator is the variance

play21:35

explained by the extra parameter in our

play21:39

example that's the variance in Mouse

play21:41

size explained by mouse weight

play21:44

if we had used mouse weight and tail

play21:47

length to explain variation in size

play21:50

then we would end up with an equation

play21:52

that had three parameters and P fit

play21:55

would equal three

play21:57

thus P fit minus p mean would equal

play22:01

three minus 1 which equals two

play22:04

now the fit has two extra parameters

play22:07

mouse weight and tail length

play22:11

with the fancier equation for the fit

play22:13

the numerator is the variance and mouse

play22:16

size explained by mouse weight and tail

play22:19

length

play22:21

now let's talk about the denominator for

play22:23

our equation for f

play22:26

denominator is the variation in Mouse

play22:29

size not explained by the fit

play22:33

that is to say it's the sum of squares

play22:36

of the residuals that remain after we

play22:39

fit our new line to the data

play22:42

y divide sum of squares fit by n minus P

play22:47

fit instead of just n

play22:50

intuitively the more parameters you have

play22:53

in your equation the more data you need

play22:55

to estimate them for example you only

play22:59

need two points to estimate a line but

play23:02

you need three points to estimate a

play23:04

plane

play23:05

if the fit is good

play23:07

then the variation explained by the

play23:09

extra parameters in the fit will be a

play23:12

large number and the variation not

play23:14

explained by the extra parameters in the

play23:16

fit will be a small number

play23:18

that makes f a really large number

play23:22

now that question we've all been dying

play23:24

to know the answer to how do we turn

play23:27

this number into a p-value

play23:30

conceptually

play23:31

generate a set of random data

play23:34

calculate the mean and the sum of

play23:37

squares around the mean

play23:39

calculate the fit in the sum of squares

play23:41

around the fit

play23:43

now plug all those values into our

play23:46

equation for f

play23:48

and that will give us a number in this

play23:50

case that number is 2.

play23:52

now plot that number in a histogram

play23:55

now generate another set of random data

play23:59

calculate the mean and the sum of

play24:01

squares around the mean

play24:03

then calculate the fit and the sum of

play24:06

squares around the fit

play24:08

plug those values into our equation for

play24:10

f

play24:12

and in this case we get f equals three

play24:15

so we then plug that value into our

play24:17

histogram

play24:19

and then we repeat with yet another set

play24:22

of random data in this case we got f

play24:25

equals one that's plotted on our

play24:27

histogram

play24:28

and we just keep generating more and

play24:31

more random data sets calculating the

play24:33

sums of squares plugging them into our

play24:36

equation for f and plotting the results

play24:38

on our histogram

play24:40

now imagine we did that hundreds if not

play24:43

millions of times

play24:45

when we're all done with our random data

play24:48

sets we return to our original data set

play24:51

we then plug the numbers into our

play24:54

equation for f in this case we got f

play24:57

equals 6.

play24:59

the p-value is the number of more

play25:02

extreme values divided by all of the

play25:05

values

play25:06

so in this case we have the value at f

play25:09

equals 6 and the value at f equals 7

play25:12

divided by all the other randomizations

play25:14

that we created originally

play25:17

if this concept is confusing to you I

play25:20

have a stat Quest that explains p-values

play25:22

so check that one out

play25:25

bam

play25:26

you can approximate the histogram with a

play25:29

line in practice rather than generating

play25:32

tons of random data sets people use the

play25:35

line to calculate the p-value

play25:38

here's an example of one standard F

play25:41

distribution that people use to

play25:43

calculate p-values the degrees of

play25:46

freedom determine the shape

play25:49

the red line represents another standard

play25:52

F distribution that people use to

play25:54

calculate p-values

play25:56

in this case the sample size used to

play25:59

draw the red line is smaller than the

play26:02

sample size used to draw the blue line

play26:05

notice that when n minus P fit equals 10

play26:09

the distribution tapers off faster

play26:13

this means that the p-value will be

play26:16

smaller when there are more samples

play26:18

relative to the number of parameters in

play26:20

the fit equation

play26:22

triple bam

play26:24

hooray we finally got our p-value now

play26:28

let's review the main ideas

play26:30

given some data that you think are

play26:33

related

play26:34

linear regression quantifies the

play26:37

relationship in the data

play26:39

this is r squared this needs to be large

play26:43

it also determines how reliable that

play26:46

relationship is

play26:48

this is the p-value that we calculated

play26:50

with f

play26:51

this needs to be small

play26:53

you need both to have an interesting

play26:55

result

play26:57

hooray we've made it to the end of

play27:00

another exciting stat Quest wow this was

play27:03

a long one I hope you had a good time

play27:06

if you like this and want to see more

play27:08

stat quests like it want to subscribe to

play27:11

my channel it's real easy just click the

play27:13

red button

play27:14

and if you have any ideas of stat quests

play27:17

that you'd like me to create just put

play27:19

them in the comments below that's all

play27:21

there is to it all right tune in next

play27:23

time for another really exciting stat

play27:25

Quest

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Linear RegressionR-squaredP-valueStatQuestData ScienceGeneticsEducationUNC Chapel HillStatisticsMathematics