Correlation Doesn't Equal Causation: Crash Course Statistics #8

CrashCourse
14 Mar 201812:17

Summary

TLDRIn this Crash Course Statistics episode, Adriene Hill explores data relationships, focusing on how one variable can predict another. She introduces scatter plots as a tool to visualize these relationships, highlighting their versatility in identifying both linear and nonlinear connections. Hill discusses the significance of regression lines in describing relationships and introduces the concept of correlation, explaining how it measures the direction and strength of the relationship between two variables. The episode emphasizes that correlation does not imply causation, warning against the common mistake of equating the two. It concludes by stressing the importance of understanding data relationships for prediction and reflection.

Takeaways

  • 📊 Scatter plots are essential tools for visualizing relationships between two continuous variables, allowing us to observe patterns and clusters in data.
  • 🔍 The regression line, introduced by Karl Pearson, is a method to describe the relationship between variables by fitting the line closest to all data points.
  • ↗️ The slope (m) of the regression line indicates the change in the dependent variable (y) for each unit increase in the independent variable (x), providing a predictive measure.
  • 🔢 The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).
  • 🔄 Positive correlation implies that as one variable increases, the other also tends to increase, while negative correlation suggests an inverse relationship.
  • 📉 Correlation does not imply causation; just because two variables are correlated does not mean one causes the other to occur.
  • 🌡️ The R-squared (r^2) value represents the proportion of the variance in the dependent variable that is predictable from the independent variable, with values ranging from 0 to 1.
  • 🚫 It's crucial to be cautious of spurious correlations, which can occur by chance or due to a third, unobserved variable influencing both variables in question.
  • 👀 Always examine scatter plots to understand the nature of the relationship between variables, as different datasets can have the same correlation coefficient but very different relationships.
  • 🌟 Understanding correlations helps in making predictions and understanding past events, but it's important to consider the context and not jump to conclusions based solely on correlation.

Q & A

  • What is the main topic of discussion in this Crash Course Statistics video?

    -The main topic of discussion is data relationships, specifically how one variable can be used to predict another, and the use of scatter plots to visualize these relationships.

  • What is a scatter plot and why is it useful in statistics?

    -A scatter plot is a type of plot that displays data points on horizontal and vertical axes, typically with one variable on the x-axis and another on the y-axis. It is useful in statistics for visualizing the relationship between two continuous variables and identifying patterns, clusters, or trends.

  • What does the term 'bivariate data' refer to in the context of this video?

    -Bivariate data refers to data that involves relationships between two continuous variables, which is the simplest form of data relationship discussed in the video.

  • How does the video illustrate the concept of linear relationships using the example of father and son heights?

    -The video uses the example of father and son heights to illustrate linear relationships by fitting a regression line through the data points, which represents the average change in son's height for each unit increase in father's height.

  • What is the significance of the regression line in the context of this video?

    -The regression line is significant because it is the line that is as close as possible to all the data points, representing the best fit for the data. It helps in predicting one variable based on the value of another and understanding the strength and direction of the relationship between variables.

  • What is the formula for a line in the context of a regression line, and what do the variables represent?

    -The formula for a line in the context of a regression line is y = mx + b, where 'm' represents the slope (the change in y for each unit change in x), 'x' is the independent variable, 'y' is the dependent variable, and 'b' is the y-intercept (the value of y when x is 0).

  • How does the video explain the concept of correlation and its importance?

    -The video explains that correlation measures the way two variables move together, indicating both the direction and the strength of their relationship. It is important because it provides insight into the nature of the relationship between variables, whether they move in the same direction (positive correlation) or opposite directions (negative correlation).

  • What is the correlation coefficient 'r' and what does it represent?

    -The correlation coefficient 'r' is a measure that quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation.

  • What is the difference between correlation and causation as discussed in the video?

    -The video emphasizes that correlation does not equal causation. While correlation indicates a relationship between two variables, causation implies that one variable causes the other to change. The video warns against interpreting correlations as causations without considering other factors.

  • How does the video use the 'Cool-Cage Act' as an example to illustrate the misunderstanding of correlation?

    -The 'Cool-Cage Act' is a satirical example used in the video to show how correlations can be misinterpreted. The act proposes to reduce drownings by limiting air conditioning sales and preventing Nicolas Cage from starring in movies, based on the observed correlations with drownings. The video points out that these correlations are likely due to a third variable (heat) and do not imply causation.

  • What is the significance of R^2 in the context of this video, and how does it relate to prediction?

    -R^2, or the coefficient of determination, represents the proportion of the variance in one variable that is predictable from the other variable. It is always between 0 and 1, and a higher R^2 value indicates a better fit of the model and a more accurate prediction of one variable based on the other.

Outlines

00:00

📊 Introduction to Data Relationships

Adriene Hill introduces the topic of data relationships in statistics, focusing on how one variable can be used to predict another. The discussion begins with bivariate data, where two continuous variables are analyzed. The use of scatter plots is emphasized as a versatile tool for visualizing data relationships, allowing for the identification of both linear and nonlinear relationships. The example of Old Faithful's eruptions is used to illustrate how scatter plots can reveal patterns, such as the potential existence of two types of eruptions with different durations and latencies.

05:01

🔍 Exploring Linear Relationships and Regression

The paragraph delves into linear relationships, using the example of the heights of fathers and their sons to demonstrate how data can be analyzed beyond mere observation. Statistician Karl Pearson's 1903 paper is highlighted, which introduced the concept of fitting a regression line to data points, allowing for more accurate predictions than subjective assessments. The regression line's formula, y = mx + b, is explained, with 'm' representing the slope that indicates the change in the son's height for each inch increase in the father's height. The importance of considering the units of measurement when interpreting the slope is also discussed.

10:02

🔗 Understanding Correlation and Its Limitations

This section explains the concept of correlation, which measures the direction and strength of the relationship between two variables. Positive and negative correlations are illustrated with examples, and the significance of the correlation coefficient 'r' is discussed, which ranges from -1 to 1. The paragraph also introduces 'r^2', which represents the proportion of variance in one variable that can be predicted by the other. The Mayor's misguided attempt to reduce drownings by correlating them with air conditioning usage and Nicolas Cage movies humorously illustrates the critical distinction between correlation and causation, emphasizing the importance of not equating correlation with causation.

🧐 The Importance of Visualizing Data with Scatter Plots

The final paragraph warns against relying solely on correlation coefficients like 'r' and 'R^2' without visualizing data through scatter plots. It presents the 'Datasarus Dozen,' a set of scatter plots with the same correlation but different relationships, underscoring the need for visual analysis. The paragraph concludes by emphasizing the importance of understanding data relationships for predicting future events and reflecting on past occurrences, suggesting that correlation can be a tool for understanding various aspects of life, including personal relationships.

Mindmap

Keywords

💡Relationships

In the context of the video, 'relationships' refers to the statistical associations between different sets of data. The video discusses how one variable can be used to predict another, such as predicting loan defaults based on writing style or behavior changes after watching certain movies. This concept is central to understanding data patterns and making informed predictions.

💡Bivariate Data

Bivariate data is a type of data that involves relationships between two continuous variables. The video uses the example of Old Faithful's eruption duration and latency to illustrate how bivariate data can be visualized and analyzed. Understanding bivariate data is crucial for identifying patterns and making predictions in various fields, such as in the study of natural phenomena or social behaviors.

💡Scatter Plot

A scatter plot is a graphical representation used to display the values of two variables for a set of data. The video emphasizes its versatility and usefulness in visualizing data relationships. It is used to plot Old Faithful's eruption data, showing clusters that suggest different types of eruptions. Scatter plots are fundamental in statistics for exploring and understanding the nature of data relationships.

💡Regression Line

The regression line, as mentioned in the video, is a straight line that fits as closely as possible to all the data points in a scatter plot. It is used to describe the relationship between variables, such as the heights of fathers and their sons. The video explains that this line helps in making predictions about one variable based on the value of another, which is a key application in statistical analysis.

💡Slope (m)

The slope (m) in the context of the video refers to the rate of change in the dependent variable (y) for each unit increase in the independent variable (x). It is a crucial component of the linear equation y = mx + b. The video uses the example of father and son heights to illustrate how the slope can indicate the expected increase in a son's height for each inch increase in the father's height.

💡Correlation

Correlation is a statistical measure that expresses the extent to which two variables move in relation to each other. The video explains how correlation can indicate both the direction and the strength of the relationship between variables. It is used to understand whether variables are positively or negatively related and to what degree they are associated.

💡Correlation Coefficient (r)

The correlation coefficient (r) is a numerical value that ranges from -1 to 1, indicating the strength and direction of the linear relationship between two variables. The video clarifies that a positive r value suggests a positive relationship, while a negative r value indicates a negative relationship. It is used to quantify the degree of correlation, which is essential for statistical analysis and prediction.

💡R-squared (R^2)

R-squared (R^2) is a statistical measure that represents the proportion of the variance for a dependent variable that's explained by an independent variable in a regression model. The video uses R^2 to explain how well one variable can predict another, with a value of 1 indicating perfect prediction. It is a key metric for evaluating the goodness of fit in regression analysis.

💡Causation

Causation refers to a cause-and-effect relationship between events or variables. The video warns against equating correlation with causation, emphasizing that just because two variables are correlated does not mean one causes the other. This is a critical distinction in statistical analysis, as it prevents incorrect assumptions and conclusions about the nature of relationships.

💡Spurious Correlations

Spurious correlations are correlations that appear to exist between two variables but are actually the result of coincidence or are caused by a third, unrelated variable. The video uses the example of a correlation between air conditioning sales and drownings to illustrate how spurious correlations can mislead if not interpreted carefully. Understanding and identifying spurious correlations is important for accurate data analysis.

Highlights

Introduction to data relationships and their predictive power.

Explanation of bivariate data and its visualization through scatter plots.

Scatter plots described as a versatile tool in statistical graphics.

Example of using a scatter plot to analyze Old Faithful eruption data.

Identification of clusters in scatter plots indicating potential relationships.

Discussion on linear and nonlinear relationships in data.

Historical context: Karl Pearson's paper on the relationship between fathers' and sons' heights.

Introduction of the regression line and its formula y = mx + b.

Explanation of the slope (m) in a regression line and its significance.

The utility of regression lines in predicting one variable from another.

Caution about the dependency of the slope on the units of measurement.

Introduction to correlation as a measure of the strength and direction of relationships.

Description of positive and negative correlations using scatter plot examples.

Detailed look at how positive and negative correlations manifest in scatter plots.

Explanation of the correlation coefficient (r) and its interpretation.

The concept of r^2 as a measure of predictive accuracy.

Humorous example of spurious correlations with air conditioning and Nicolas Cage movies.

Emphasis on the difference between correlation and causation.

Discussion on the potential reasons behind observed correlations.

Warning against relying solely on r and R^2 without visual inspection of data.

Conclusion on the importance of understanding data relationships for prediction and reflection.

Transcripts

play00:02

Hi., I’m Adriene Hill and Welcome back to Crash Course Statistics. Today we’re talking

play00:07

about relationships. No, not why you and your bestie are platonic soulmates, or why your

play00:08

cat just doesn’t seem to like you, we’re talking about data relationships like how

play00:12

you can use one variable to predict another.

play00:15

Like if you can predict whether people who write in all capital letters are more likely

play00:19

to default on loans. Whether people drive faster after they watch Fast & Furious movies.

play00:24

Or whether blink more often when they're lying.

play00:27

INTRO

play00:37

We’ll start with the simplest data relationship, one between two continuous variables, also

play00:41

called bivariate data. But first, we’re going to need to visualize our data using

play00:45

a scatter plot. The scatter plot has been called “the most

play00:48

versatile, polymorphic, and generally useful invention in the history of statistical graphics.”

play00:54

Impressive....And...as such...they are pretty much everywhere.

play00:56

Including on your favorite news site...News outlets now have data journalists on staff

play01:00

to visualize and make sense of data.

play01:02

To make a scatterplot of Old Faithful eruption duration and latency--which is the time between

play01:07

eruptions-- we put one variable on the x -axis and the other on the y-axis. Then each data

play01:13

point is placed so that it’s in line with both it’s eruption duration, and it’s

play01:17

latency.

play01:18

Now we can see a relationship. There are clusters, two blob-y looking groups of points, which

play01:23

supports our guess that there are likely two kinds of eruptions, one with a longer build

play01:27

up and longer duration, and one with a shorter build up and shorter duration.

play01:32

Just like the histogram and dot plot, a scatter plot allows us to see the shape and spread

play01:36

of data--but now in two dimensions! This data is clustered, but scatter plots are useful

play01:42

for identifying all kinds of relationships, both linear and nonlinear.

play01:46

For now, let’s focus on linear relationships with a classic example--the relationship between

play01:51

the the heights of fathers and sons. It makes sense that a tall father would produce a tall

play01:55

son, but we can do better than just a hand wave-y statement.

play01:59

In 1903, the statistician Karl Pearson published an influential paper--in his own journal,

play02:04

Biometrika. One section of the paper describes the relationship between the heights of dads

play02:09

and their male children.

play02:10

In this paper, Pearson fit a line through the data to describe the relationship, rather

play02:14

than just relying on his eyes to see a pattern. The line--called a regression line--is the

play02:20

line as close as possible to all the points at the same time.

play02:23

And note here Pearson used feet and inches in his paper so we will too.

play02:28

Lines are a great way to describe a relationship because they have a nice formula, y = mx + b

play02:34

just like you learned in algebra. The m --or slope--tells you a lot about your data.

play02:39

It tells you that an increase in 1 inch of a father’s height, leads to an increase

play02:43

of m in the son’s height (about half an inch in Pearson’s paper).

play02:47

So on average dads who are 6’1 tall have sons that are about half an inch taller than

play02:51

the sons of fathers who are 6 feet tall. That allowed Pearson to make a prediction about

play02:56

the height of the son from the height of the father.

play02:58

And this is why these lines are so useful; they allow us to pretty accurately predict

play03:03

one variable based on the value of another. The relationship between car weight and gas

play03:07

efficiency allows us to be pretty sure a SMART car gets better mileage than a Hummer.

play03:13

One note of caution: the slope relies heavily on the units of x and y since it’s a measure

play03:18

of how many units y increases with each increase of 1 unit in x. If I decided to measure the

play03:25

Son’s height in meters, the m...or slope... will change, even though the relationship

play03:30

didn’t.

play03:31

When we see a non-zero slope--also called a regression coefficient--it’s a sign that

play03:35

there’s some kind of relationship between our two variables, but that’s pretty much

play03:39

all it tells us. We don’t know how strong that relationship is. For more information,

play03:44

we need to look at correlation.

play03:45

Correlation measures the way two variables move together, both the direction and closeness

play03:50

of their movement. You may have read articles claim that there’s

play03:53

a positive correlation between exercise and heart health. That just means if you exercise

play03:58

more, your heart tends to be healthier. A positive correlation looks something like

play04:03

this on a scatter plot:

play04:04

While a negative one, like the correlation between number of cigarettes smoked each day

play04:08

and lung health, might look like this. Higher values of cigarettes smoked tend to have lower

play04:14

values for lung health:

play04:15

We now know what correlations look like in general, but to understand them more deeply,

play04:19

we’re going to take a closer look.

play04:21

If two variables have a positive correlation, they move in the same direction. We can see

play04:25

this in our scatter plot if we draw two lines across the graph--one at the mean of each

play04:30

of our variables--to divide the plot into four quadrants.

play04:34

When two values are positively correlated--like how many miles you run and the number of calories

play04:38

you burn--most of the points will be in the upper right and lower left quadrants. In these

play04:44

quadrants, the values for miles and calories burned are either both large, or both small.

play04:48

The more miles you run, the more calories you burn.

play04:51

The opposite happens when the correlation is negative, like the relationship between

play04:55

vaccination rates and the rates of preventable illnesses. Instead of moving together, the

play05:00

variables move in the opposite direction.

play05:02

So, the points are mostly in the upper left and lower right quadrants where either vaccination

play05:06

rate is small and rate of illness is large, or visa versa. Since vaccination rate and

play05:12

rate of preventable illness have a negative correlation, as vaccination rates increase,

play05:18

rates of preventable illness decrease.

play05:20

The more closely two variables move together... the stronger the relationship will be, positive

play05:25

or negative.

play05:26

If the points are in all of the quadrants pretty evenly. You just have a blob or a cloud.

play05:30

You don’t have a strong relationship.

play05:32

As I mentioned before, the units of your variables can affect the regression coefficient, and

play05:36

can also affect the calculation of our correlation. To get around that, we use the standard deviations

play05:41

to scale our correlation so that it is always between -1 and 1. This is our correlation

play05:47

coefficient, r.

play05:49

Interpreting r involves two things: the sign of the number...that is whether it’s positive

play05:52

or negative and how big the number is. The sign will tell you whether your two variables

play05:57

move together (positive r), or in opposite directions (negative r).

play06:01

A correlation of 1 or -1 would be a perfectly straight line, meaning you can exactly predict

play06:06

one value from the other. Say we looked at correlation of the number of hours you’re

play06:10

asleep vs. awake. If I know one of those values I can tell you exactly what the other one

play06:16

is. We all have only 24 hours a day even Beyonce.

play06:19

As you get closer and closer to a correlation of 0, the points are more and more spread

play06:24

out around our regression line, and eventually at 0, there’s no linear relationship at

play06:30

all...it’s just dots.

play06:31

When you look at a scatterplot, remember that you can’t deduce a correlation just by the

play06:35

steepness of the regression line. In our earlier father/son heights example we changed the

play06:40

units to meters and our line didn’t look as steep, even though it’s the same data.

play06:45

Data with steep lines can have low or high correlations.

play06:49

We also use the squared correlation coefficient r^2 .... R^2 is always between 0 and 1, and

play06:54

tells us--in decimal form--how much of the variance in one variable is predicted by the

play07:01

other. In other words, it tells us how well we can predict one variable if we know the other.

play07:05

While they won’t usually give R^2 an explicit mention, you’ll see articles claim things

play07:09

like “ the ounces of soda a person drinks is highly predictive of weight”, which means

play07:13

there’s a large R^2. You can think of R^2 as a measure of how accurate your guesses

play07:18

would be if you used your linear equation to predict one variable from another.

play07:23

If you have an R^2 of 0.7 for the cigarettes and lung health data that would mean cigarette

play07:28

usage predicts 70% of the variation in how healthy our lungs are. You could pretty accurately

play07:34

predict someone’s lung health if you knew how many cigarettes they smoked.

play07:37

An R^2 of 1 means you can perfectly predict one variable from the other since 100% of

play07:42

the variation is in one variable. This can seem pretty obvious when you think about conversion.

play07:47

Like temperature in Fahrenheit can be predicted by temperature in Celsius. In this case we’re

play07:52

not actually measuring the temperature in Farenheit, but it is perfectly predicted by

play07:56

Celcius. So in general, the higher the R^2, the better the fit.

play08:00

Crash Course World News

play08:03

Breaking news from city hall today! The mayor has announced a plan to cut down on the number

play08:09

of people who drown every year.

play08:11

Sources close to the Mayor tell us that he’s seen some very interesting correlations between

play08:16

drownings and air conditioning usage and drownings and Nicolas Cage movies. Or as I like to call

play08:23

it...air cons and Con Airs.

play08:27

Both are highly correlated with drownings. Here’s evidence.

play08:30

If we look at AC sales data over the past 10 years. And even more proof if we look at

play08:34

Nicolas Cage movies over the same time period. The Nic Cage data was provided to the city

play08:40

by Tyler Vigen.

play08:41

So as of today, our mayor has enacted the Cool-Cage act which will prohibit sale of

play08:46

air conditioners and create a Nicolas Cage task force who will do everything-- to prevent

play08:52

Nicolas Cage from starring in any movies. The Mayor assures us that because of the strong

play08:57

correlations she saw, as well as the strong will of our city, we will surely have next

play09:03

to no drownings this coming year.

play09:06

The Cool Cage act may seem silly, but we’re constantly flooded with messages--that equate

play09:10

correlation with causation. And as you’ve heard before: CORRELATION DOESN’T EQUAL

play09:15

CAUSATION.

play09:16

Just because two variables are related doesn’t mean that one variable causes the other. The

play09:21

examples the mayor uses are perfect examples of things that can go wrong when interpreting correlations.

play09:28

When one thing (A) is correlated with another (B), there’s a few possible reasons

play09:32

A causes B B causes A

play09:34

There’s a third Variable C that causes both A and B, even though A and B aren’t related

play09:40

Or there’s no relationship at all. it’s just a coincidence.

play09:43

The correlation the Mayor saw between air conditioning and drownings is probably caused

play09:47

by a third, unmentioned variable: heat! When it’s hot people buy more air conditioners

play09:52

and go for a swim leading to a correlation even though there’s no direct link between

play09:56

the two.

play09:57

And as for Nicholas Cage, he probably shouldn’t feel too guilty about causing world-wide drownings.

play10:01

Sometimes two completely unrelated things are correlated just by random chance, with

play10:06

no causal link at all.

play10:08

These correlations get called spurious correlations, and they can be hard to catch.

play10:12

But when the correlation is between two VERY specific things, like Nicholas Cage movies

play10:16

and all drownings in 3 feet of water when a dog was present you should be suspicious

play10:21

that someone tried every weird subset of data until they found a relationship.

play10:24

Before we finish with correlation, I just want to warn you: r and R^2 aren’t everything;

play10:29

It’s important to look at a scatter plot of data when you can.

play10:32

These are the “Datasarus Dozen”... these very different plots all have the same correlation,

play10:36

but we can see that the relationships are completely different.

play10:40

Correlation is an important piece of the puzzle when you’re looking for a linear relationship

play10:44

between two variables. It goes above and beyond the Y=mX + b and gives us information about

play10:50

how well that line explains the data.

play10:53

Understanding the relationships between variables and events helps us predict what things are

play10:57

going to happen in the future, and also reflect on why things occured in the past.

play11:01

A correlation could help you predict how much money you’ll make after years of working

play11:05

your way up as a lemonade salesperson. Or if watching that next Fast and Furious movie

play11:10

in the theater...might encourage people to speed. According to an analysis by a Harvard

play11:15

Medical School professor Anupam Jena those two things do look related.

play11:19

Relationships are important--the human kind and the data kind.

play11:22

Correlation allows us to better understand relationships between data.

play11:26

And maybe also the data of our relationships.

play11:28

Maybe you can find correlations between the amount of time you spend at work or school

play11:32

and with how much affection your cat shows you. Mr. Fluffy misses you.

play11:37

Thanks for watching. I’ll see you next time.

Rate This

5.0 / 5 (0 votes)

Ähnliche Tags
StatisticsData RelationshipsPredictive AnalysisScatter PlotsRegression LineCorrelationCausationHealth DataEducational ContentStatistical Graphics
Benötigen Sie eine Zusammenfassung auf Englisch?