Correlation Doesn't Equal Causation: Crash Course Statistics #8
Summary
TLDRIn this Crash Course Statistics episode, Adriene Hill explores data relationships, focusing on how one variable can predict another. She introduces scatter plots as a tool to visualize these relationships, highlighting their versatility in identifying both linear and nonlinear connections. Hill discusses the significance of regression lines in describing relationships and introduces the concept of correlation, explaining how it measures the direction and strength of the relationship between two variables. The episode emphasizes that correlation does not imply causation, warning against the common mistake of equating the two. It concludes by stressing the importance of understanding data relationships for prediction and reflection.
Takeaways
- 📊 Scatter plots are essential tools for visualizing relationships between two continuous variables, allowing us to observe patterns and clusters in data.
- 🔍 The regression line, introduced by Karl Pearson, is a method to describe the relationship between variables by fitting the line closest to all data points.
- ↗️ The slope (m) of the regression line indicates the change in the dependent variable (y) for each unit increase in the independent variable (x), providing a predictive measure.
- 🔢 The correlation coefficient (r) measures the strength and direction of the linear relationship between two variables, ranging from -1 (perfect negative correlation) to 1 (perfect positive correlation).
- 🔄 Positive correlation implies that as one variable increases, the other also tends to increase, while negative correlation suggests an inverse relationship.
- 📉 Correlation does not imply causation; just because two variables are correlated does not mean one causes the other to occur.
- 🌡️ The R-squared (r^2) value represents the proportion of the variance in the dependent variable that is predictable from the independent variable, with values ranging from 0 to 1.
- 🚫 It's crucial to be cautious of spurious correlations, which can occur by chance or due to a third, unobserved variable influencing both variables in question.
- 👀 Always examine scatter plots to understand the nature of the relationship between variables, as different datasets can have the same correlation coefficient but very different relationships.
- 🌟 Understanding correlations helps in making predictions and understanding past events, but it's important to consider the context and not jump to conclusions based solely on correlation.
Q & A
What is the main topic of discussion in this Crash Course Statistics video?
-The main topic of discussion is data relationships, specifically how one variable can be used to predict another, and the use of scatter plots to visualize these relationships.
What is a scatter plot and why is it useful in statistics?
-A scatter plot is a type of plot that displays data points on horizontal and vertical axes, typically with one variable on the x-axis and another on the y-axis. It is useful in statistics for visualizing the relationship between two continuous variables and identifying patterns, clusters, or trends.
What does the term 'bivariate data' refer to in the context of this video?
-Bivariate data refers to data that involves relationships between two continuous variables, which is the simplest form of data relationship discussed in the video.
How does the video illustrate the concept of linear relationships using the example of father and son heights?
-The video uses the example of father and son heights to illustrate linear relationships by fitting a regression line through the data points, which represents the average change in son's height for each unit increase in father's height.
What is the significance of the regression line in the context of this video?
-The regression line is significant because it is the line that is as close as possible to all the data points, representing the best fit for the data. It helps in predicting one variable based on the value of another and understanding the strength and direction of the relationship between variables.
What is the formula for a line in the context of a regression line, and what do the variables represent?
-The formula for a line in the context of a regression line is y = mx + b, where 'm' represents the slope (the change in y for each unit change in x), 'x' is the independent variable, 'y' is the dependent variable, and 'b' is the y-intercept (the value of y when x is 0).
How does the video explain the concept of correlation and its importance?
-The video explains that correlation measures the way two variables move together, indicating both the direction and the strength of their relationship. It is important because it provides insight into the nature of the relationship between variables, whether they move in the same direction (positive correlation) or opposite directions (negative correlation).
What is the correlation coefficient 'r' and what does it represent?
-The correlation coefficient 'r' is a measure that quantifies the strength and direction of the linear relationship between two variables. It ranges from -1 to 1, where 1 indicates a perfect positive correlation, -1 indicates a perfect negative correlation, and 0 indicates no linear correlation.
What is the difference between correlation and causation as discussed in the video?
-The video emphasizes that correlation does not equal causation. While correlation indicates a relationship between two variables, causation implies that one variable causes the other to change. The video warns against interpreting correlations as causations without considering other factors.
How does the video use the 'Cool-Cage Act' as an example to illustrate the misunderstanding of correlation?
-The 'Cool-Cage Act' is a satirical example used in the video to show how correlations can be misinterpreted. The act proposes to reduce drownings by limiting air conditioning sales and preventing Nicolas Cage from starring in movies, based on the observed correlations with drownings. The video points out that these correlations are likely due to a third variable (heat) and do not imply causation.
What is the significance of R^2 in the context of this video, and how does it relate to prediction?
-R^2, or the coefficient of determination, represents the proportion of the variance in one variable that is predictable from the other variable. It is always between 0 and 1, and a higher R^2 value indicates a better fit of the model and a more accurate prediction of one variable based on the other.
Outlines
📊 Introduction to Data Relationships
Adriene Hill introduces the topic of data relationships in statistics, focusing on how one variable can be used to predict another. The discussion begins with bivariate data, where two continuous variables are analyzed. The use of scatter plots is emphasized as a versatile tool for visualizing data relationships, allowing for the identification of both linear and nonlinear relationships. The example of Old Faithful's eruptions is used to illustrate how scatter plots can reveal patterns, such as the potential existence of two types of eruptions with different durations and latencies.
🔍 Exploring Linear Relationships and Regression
The paragraph delves into linear relationships, using the example of the heights of fathers and their sons to demonstrate how data can be analyzed beyond mere observation. Statistician Karl Pearson's 1903 paper is highlighted, which introduced the concept of fitting a regression line to data points, allowing for more accurate predictions than subjective assessments. The regression line's formula, y = mx + b, is explained, with 'm' representing the slope that indicates the change in the son's height for each inch increase in the father's height. The importance of considering the units of measurement when interpreting the slope is also discussed.
🔗 Understanding Correlation and Its Limitations
This section explains the concept of correlation, which measures the direction and strength of the relationship between two variables. Positive and negative correlations are illustrated with examples, and the significance of the correlation coefficient 'r' is discussed, which ranges from -1 to 1. The paragraph also introduces 'r^2', which represents the proportion of variance in one variable that can be predicted by the other. The Mayor's misguided attempt to reduce drownings by correlating them with air conditioning usage and Nicolas Cage movies humorously illustrates the critical distinction between correlation and causation, emphasizing the importance of not equating correlation with causation.
🧐 The Importance of Visualizing Data with Scatter Plots
The final paragraph warns against relying solely on correlation coefficients like 'r' and 'R^2' without visualizing data through scatter plots. It presents the 'Datasarus Dozen,' a set of scatter plots with the same correlation but different relationships, underscoring the need for visual analysis. The paragraph concludes by emphasizing the importance of understanding data relationships for predicting future events and reflecting on past occurrences, suggesting that correlation can be a tool for understanding various aspects of life, including personal relationships.
Mindmap
Keywords
💡Relationships
💡Bivariate Data
💡Scatter Plot
💡Regression Line
💡Slope (m)
💡Correlation
💡Correlation Coefficient (r)
💡R-squared (R^2)
💡Causation
💡Spurious Correlations
Highlights
Introduction to data relationships and their predictive power.
Explanation of bivariate data and its visualization through scatter plots.
Scatter plots described as a versatile tool in statistical graphics.
Example of using a scatter plot to analyze Old Faithful eruption data.
Identification of clusters in scatter plots indicating potential relationships.
Discussion on linear and nonlinear relationships in data.
Historical context: Karl Pearson's paper on the relationship between fathers' and sons' heights.
Introduction of the regression line and its formula y = mx + b.
Explanation of the slope (m) in a regression line and its significance.
The utility of regression lines in predicting one variable from another.
Caution about the dependency of the slope on the units of measurement.
Introduction to correlation as a measure of the strength and direction of relationships.
Description of positive and negative correlations using scatter plot examples.
Detailed look at how positive and negative correlations manifest in scatter plots.
Explanation of the correlation coefficient (r) and its interpretation.
The concept of r^2 as a measure of predictive accuracy.
Humorous example of spurious correlations with air conditioning and Nicolas Cage movies.
Emphasis on the difference between correlation and causation.
Discussion on the potential reasons behind observed correlations.
Warning against relying solely on r and R^2 without visual inspection of data.
Conclusion on the importance of understanding data relationships for prediction and reflection.
Transcripts
Hi., I’m Adriene Hill and Welcome back to Crash Course Statistics. Today we’re talking
about relationships. No, not why you and your bestie are platonic soulmates, or why your
cat just doesn’t seem to like you, we’re talking about data relationships like how
you can use one variable to predict another.
Like if you can predict whether people who write in all capital letters are more likely
to default on loans. Whether people drive faster after they watch Fast & Furious movies.
Or whether blink more often when they're lying.
INTRO
We’ll start with the simplest data relationship, one between two continuous variables, also
called bivariate data. But first, we’re going to need to visualize our data using
a scatter plot. The scatter plot has been called “the most
versatile, polymorphic, and generally useful invention in the history of statistical graphics.”
Impressive....And...as such...they are pretty much everywhere.
Including on your favorite news site...News outlets now have data journalists on staff
to visualize and make sense of data.
To make a scatterplot of Old Faithful eruption duration and latency--which is the time between
eruptions-- we put one variable on the x -axis and the other on the y-axis. Then each data
point is placed so that it’s in line with both it’s eruption duration, and it’s
latency.
Now we can see a relationship. There are clusters, two blob-y looking groups of points, which
supports our guess that there are likely two kinds of eruptions, one with a longer build
up and longer duration, and one with a shorter build up and shorter duration.
Just like the histogram and dot plot, a scatter plot allows us to see the shape and spread
of data--but now in two dimensions! This data is clustered, but scatter plots are useful
for identifying all kinds of relationships, both linear and nonlinear.
For now, let’s focus on linear relationships with a classic example--the relationship between
the the heights of fathers and sons. It makes sense that a tall father would produce a tall
son, but we can do better than just a hand wave-y statement.
In 1903, the statistician Karl Pearson published an influential paper--in his own journal,
Biometrika. One section of the paper describes the relationship between the heights of dads
and their male children.
In this paper, Pearson fit a line through the data to describe the relationship, rather
than just relying on his eyes to see a pattern. The line--called a regression line--is the
line as close as possible to all the points at the same time.
And note here Pearson used feet and inches in his paper so we will too.
Lines are a great way to describe a relationship because they have a nice formula, y = mx + b
just like you learned in algebra. The m --or slope--tells you a lot about your data.
It tells you that an increase in 1 inch of a father’s height, leads to an increase
of m in the son’s height (about half an inch in Pearson’s paper).
So on average dads who are 6’1 tall have sons that are about half an inch taller than
the sons of fathers who are 6 feet tall. That allowed Pearson to make a prediction about
the height of the son from the height of the father.
And this is why these lines are so useful; they allow us to pretty accurately predict
one variable based on the value of another. The relationship between car weight and gas
efficiency allows us to be pretty sure a SMART car gets better mileage than a Hummer.
One note of caution: the slope relies heavily on the units of x and y since it’s a measure
of how many units y increases with each increase of 1 unit in x. If I decided to measure the
Son’s height in meters, the m...or slope... will change, even though the relationship
didn’t.
When we see a non-zero slope--also called a regression coefficient--it’s a sign that
there’s some kind of relationship between our two variables, but that’s pretty much
all it tells us. We don’t know how strong that relationship is. For more information,
we need to look at correlation.
Correlation measures the way two variables move together, both the direction and closeness
of their movement. You may have read articles claim that there’s
a positive correlation between exercise and heart health. That just means if you exercise
more, your heart tends to be healthier. A positive correlation looks something like
this on a scatter plot:
While a negative one, like the correlation between number of cigarettes smoked each day
and lung health, might look like this. Higher values of cigarettes smoked tend to have lower
values for lung health:
We now know what correlations look like in general, but to understand them more deeply,
we’re going to take a closer look.
If two variables have a positive correlation, they move in the same direction. We can see
this in our scatter plot if we draw two lines across the graph--one at the mean of each
of our variables--to divide the plot into four quadrants.
When two values are positively correlated--like how many miles you run and the number of calories
you burn--most of the points will be in the upper right and lower left quadrants. In these
quadrants, the values for miles and calories burned are either both large, or both small.
The more miles you run, the more calories you burn.
The opposite happens when the correlation is negative, like the relationship between
vaccination rates and the rates of preventable illnesses. Instead of moving together, the
variables move in the opposite direction.
So, the points are mostly in the upper left and lower right quadrants where either vaccination
rate is small and rate of illness is large, or visa versa. Since vaccination rate and
rate of preventable illness have a negative correlation, as vaccination rates increase,
rates of preventable illness decrease.
The more closely two variables move together... the stronger the relationship will be, positive
or negative.
If the points are in all of the quadrants pretty evenly. You just have a blob or a cloud.
You don’t have a strong relationship.
As I mentioned before, the units of your variables can affect the regression coefficient, and
can also affect the calculation of our correlation. To get around that, we use the standard deviations
to scale our correlation so that it is always between -1 and 1. This is our correlation
coefficient, r.
Interpreting r involves two things: the sign of the number...that is whether it’s positive
or negative and how big the number is. The sign will tell you whether your two variables
move together (positive r), or in opposite directions (negative r).
A correlation of 1 or -1 would be a perfectly straight line, meaning you can exactly predict
one value from the other. Say we looked at correlation of the number of hours you’re
asleep vs. awake. If I know one of those values I can tell you exactly what the other one
is. We all have only 24 hours a day even Beyonce.
As you get closer and closer to a correlation of 0, the points are more and more spread
out around our regression line, and eventually at 0, there’s no linear relationship at
all...it’s just dots.
When you look at a scatterplot, remember that you can’t deduce a correlation just by the
steepness of the regression line. In our earlier father/son heights example we changed the
units to meters and our line didn’t look as steep, even though it’s the same data.
Data with steep lines can have low or high correlations.
We also use the squared correlation coefficient r^2 .... R^2 is always between 0 and 1, and
tells us--in decimal form--how much of the variance in one variable is predicted by the
other. In other words, it tells us how well we can predict one variable if we know the other.
While they won’t usually give R^2 an explicit mention, you’ll see articles claim things
like “ the ounces of soda a person drinks is highly predictive of weight”, which means
there’s a large R^2. You can think of R^2 as a measure of how accurate your guesses
would be if you used your linear equation to predict one variable from another.
If you have an R^2 of 0.7 for the cigarettes and lung health data that would mean cigarette
usage predicts 70% of the variation in how healthy our lungs are. You could pretty accurately
predict someone’s lung health if you knew how many cigarettes they smoked.
An R^2 of 1 means you can perfectly predict one variable from the other since 100% of
the variation is in one variable. This can seem pretty obvious when you think about conversion.
Like temperature in Fahrenheit can be predicted by temperature in Celsius. In this case we’re
not actually measuring the temperature in Farenheit, but it is perfectly predicted by
Celcius. So in general, the higher the R^2, the better the fit.
Crash Course World News
Breaking news from city hall today! The mayor has announced a plan to cut down on the number
of people who drown every year.
Sources close to the Mayor tell us that he’s seen some very interesting correlations between
drownings and air conditioning usage and drownings and Nicolas Cage movies. Or as I like to call
it...air cons and Con Airs.
Both are highly correlated with drownings. Here’s evidence.
If we look at AC sales data over the past 10 years. And even more proof if we look at
Nicolas Cage movies over the same time period. The Nic Cage data was provided to the city
by Tyler Vigen.
So as of today, our mayor has enacted the Cool-Cage act which will prohibit sale of
air conditioners and create a Nicolas Cage task force who will do everything-- to prevent
Nicolas Cage from starring in any movies. The Mayor assures us that because of the strong
correlations she saw, as well as the strong will of our city, we will surely have next
to no drownings this coming year.
The Cool Cage act may seem silly, but we’re constantly flooded with messages--that equate
correlation with causation. And as you’ve heard before: CORRELATION DOESN’T EQUAL
CAUSATION.
Just because two variables are related doesn’t mean that one variable causes the other. The
examples the mayor uses are perfect examples of things that can go wrong when interpreting correlations.
When one thing (A) is correlated with another (B), there’s a few possible reasons
A causes B B causes A
There’s a third Variable C that causes both A and B, even though A and B aren’t related
Or there’s no relationship at all. it’s just a coincidence.
The correlation the Mayor saw between air conditioning and drownings is probably caused
by a third, unmentioned variable: heat! When it’s hot people buy more air conditioners
and go for a swim leading to a correlation even though there’s no direct link between
the two.
And as for Nicholas Cage, he probably shouldn’t feel too guilty about causing world-wide drownings.
Sometimes two completely unrelated things are correlated just by random chance, with
no causal link at all.
These correlations get called spurious correlations, and they can be hard to catch.
But when the correlation is between two VERY specific things, like Nicholas Cage movies
and all drownings in 3 feet of water when a dog was present you should be suspicious
that someone tried every weird subset of data until they found a relationship.
Before we finish with correlation, I just want to warn you: r and R^2 aren’t everything;
It’s important to look at a scatter plot of data when you can.
These are the “Datasarus Dozen”... these very different plots all have the same correlation,
but we can see that the relationships are completely different.
Correlation is an important piece of the puzzle when you’re looking for a linear relationship
between two variables. It goes above and beyond the Y=mX + b and gives us information about
how well that line explains the data.
Understanding the relationships between variables and events helps us predict what things are
going to happen in the future, and also reflect on why things occured in the past.
A correlation could help you predict how much money you’ll make after years of working
your way up as a lemonade salesperson. Or if watching that next Fast and Furious movie
in the theater...might encourage people to speed. According to an analysis by a Harvard
Medical School professor Anupam Jena those two things do look related.
Relationships are important--the human kind and the data kind.
Correlation allows us to better understand relationships between data.
And maybe also the data of our relationships.
Maybe you can find correlations between the amount of time you spend at work or school
and with how much affection your cat shows you. Mr. Fluffy misses you.
Thanks for watching. I’ll see you next time.
5.0 / 5 (0 votes)