MINI-LESSON 6: Fooled by Metrics (Correlation)
Summary
TLDRThe video script delves into the pitfalls of relying on metrics, highlighting two main issues: the randomness of metrics and the tendency to select the most favorable data. It uses the example of Pearson correlation to illustrate how even with independent variables, correlation coefficients can deviate from zero, emphasizing the stochastic nature of such measures. The script further discusses how researchers might manipulate data to find the 'best' correlation, leading to misleading conclusions. It also touches on the problem of dimensionality, where the number of possible correlations grows exponentially with the number of variables, exacerbating the risk of spurious correlations. The speaker critiques certain fields of research for their misuse of statistical methods, advocating for a more cautious interpretation of data.
Takeaways
- đ Metrics are random variables: They aren't fixed and can vary widely.
- đ Correlations aren't always reliable: Observing a correlation doesn't guarantee a true relationship.
- đČ Metrics can be manipulated: Researchers can pick the highest correlation among many to fit their narrative.
- đ Example of randomness: Using random variables can yield different correlations each time, often deviating significantly from zero.
- đ Distribution matters: With more data points, the distribution of correlation compresses, lowering standard deviation.
- đ Misleading correlations: Some studies show correlations between unrelated variables, like US spending on science and suicides.
- đ Choosing the best correlation: Researchers can exploit multiple correlations to find and report the highest one.
- đïž Importance of large sample sizes: More data points reduce the randomness and improve reliability of correlations.
- đ Dimensionality problem: Increasing the number of variables leads to exponentially more correlations, increasing the chance of spurious findings.
- 𧟠Spurious correlations: High numbers of correlations in large datasets can lead to false interpretations if not carefully managed.
Q & A
What are the two main points discussed in the transcript about metrics?
-The two main points discussed are: 1) Metrics are random variables, and 2) Metrics can be gamed because they are random variables. People often take the upper bound of these metrics, leading to misleading conclusions.
Why does the speaker compare metrics to random variables?
-The speaker compares metrics to random variables to highlight that metrics are not fixed values but can vary based on different observations. This variability can lead to different outcomes when metrics like correlation are calculated, even if the underlying variables are independent.
How does the speaker illustrate the randomness of correlation?
-The speaker uses an example with two independent random variables, X and Y, and shows that calculating Pearson correlation multiple times results in different values each time. This illustrates that correlation is a stochastic variable and not a fixed measure.
What does the speaker mean by 'metrics will be gamed'?
-The speaker means that researchers or analysts might selectively choose metrics or correlations that show the most favorable or extreme results, thereby misleadingly representing the data. This is possible because metrics, being random variables, can produce varying results, allowing for cherry-picking.
What is the significance of the 'law of large numbers' in the context of metrics?
-The 'law of large numbers' is mentioned to explain that as the sample size increases, the distribution of the correlation becomes more compressed, leading to a more stable and less variable estimate of the correlation. This helps in reducing the randomness but doesn't eliminate it completely.
What example does the speaker give to show how researchers can misuse correlations?
-The speaker gives examples of absurd correlations, such as the relationship between U.S. spending on science and technology and the number of suicides by hanging, to illustrate how researchers can misuse correlations by selecting extreme or coincidental relationships that appear significant but are actually meaningless.
What is the speaker's criticism of certain fields like psychology and political science?
-The speaker criticizes these fields for relying on metrics and correlations without sufficient scrutiny, leading to findings that are often spurious or meaningless. The speaker argues that these fields produce data that are frequently unreliable because they do not account for the random nature of the metrics they use.
How does the speaker suggest testing the randomness of metrics?
-The speaker suggests using Monte Carlo simulations to replicate random datasets and observe the correlations or slopes that emerge. This method helps demonstrate how entirely random data can still produce seemingly significant correlations, underscoring the importance of skepticism in interpreting such metrics.
What problem does the speaker highlight regarding the number of random variables in a dataset?
-The speaker highlights that as the number of random variables (p) increases, the number of possible correlations (which grows at a rate of p squared) increases significantly. This leads to a higher chance of finding spurious correlations, making it challenging to identify genuine relationships without a large enough sample size (n) to compensate for this complexity.
What advice does the speaker offer for interpreting correlations or slopes in research?
-The speaker advises that correlations and slopes should be interpreted with scrutiny. A small correlation or slope is much closer to zero than it might appear, and researchers should be cautious about drawing strong conclusions from these metrics without considering the potential for randomness and spuriousness.
Outlines
đČ Metrics as Random Variables and the Pitfalls of Correlation
The first paragraph discusses how metrics are inherently random variables, making correlations observed between them similarly subject to randomness. The speaker emphasizes that metrics should not be perceived as fixed entities like 'tomatoes' but rather as stochastic elements whose results can vary. By using examples, the speaker shows how correlation values fluctuate due to their random nature. Even though the expected correlation between two independent variables is zero, actual samples often yield varying results due to inherent randomness. The law of large numbers is mentioned as a factor that reduces this variability with larger sample sizes, though the focus remains on the pitfalls of small sample correlations and the potential for these correlations to be misleading or gamed.
đŻ The Danger of Selecting the Best Correlation
The second paragraph highlights the issue of researchers gaming the system by selectively choosing the highest correlation from a set of correlations, a process that can be misleading. The speaker critiques the tendency of researchers to exploit the stochastic nature of correlations, emphasizing that this practice can lead to spurious or false conclusions. The example of unrelated data points, such as U.S. spending on science and technology and unrelated events like suicides or movie appearances, demonstrates how misleading correlations can arise. The speaker warns that, due to the random nature of correlation, the correlations researchers present are often not representative of true relationships.
𧩠Spurious Correlations and Dimensionality Problems in Research
The third paragraph delves into the problem of spurious correlations arising from high-dimensional data sets. The speaker explains that in fields with many variables, such as finance, the sheer number of potential correlations increases the likelihood of observing strong but meaningless correlations. This issue is exacerbated when adding more variables, which compounds the number of possible correlations and thus increases the risk of spurious findings. The speaker also criticizes the state of observational research, arguing that much of it is stale and meaningless due to the problems introduced by high dimensionality and insufficient data per variable. The discussion concludes with a caution about the biases present in research that stem from these statistical issues.
Mindmap
Keywords
đĄMetrics
đĄRandom Variables
đĄCorrelation
đĄLaw of Large Numbers
đĄSpurious Correlation
đĄDimensionality
đĄPearson Correlation
đĄGaming Metrics
đĄStochastic Variable
đĄMonte Carlo Simulation
Highlights
Metrics are random variables and should be treated as such, not as fixed values.
Correlation between two variables is also a random variable, not a fixed number.
Even with independent, uncorrelated variables, running a Pearson correlation can yield non-zero results due to randomness.
The distribution of correlations around zero has significant variance, with about a 32% chance of seeing a correlation above 0.25 in absolute value.
Increasing the sample size compresses the distribution of correlation values, reducing the standard deviation.
Researchers can exploit randomness by selecting the highest correlation among many tests, leading to misleading results.
The relationship between unrelated variables, such as U.S. spending on science and the number of suicides, can appear significant due to random correlation.
The slope (beta) in linear regression, like correlation, should be interpreted with caution, as small slopes are often much closer to zero than they appear.
Many researchers, particularly in fields like psychology and political science, may not fully understand the limitations of using these metrics, leading to unreliable conclusions.
The problem of dimensionality: as the number of variables (p) increases, the number of correlations grows quadratically, which increases the likelihood of spurious correlations.
In observational research, adding more variables without sufficient data leads to spurious correlations, making results less reliable.
Spuriousness in correlation decreases with the square root of sample size (n), but increases dramatically with the number of variables (p).
The challenge in research is balancing the number of variables with sufficient data to avoid spurious correlations.
The presentation highlights the importance of understanding the stochastic nature of metrics and correlations in research to avoid false conclusions.
The discussion points out the need for more rigorous methods to tame spurious correlations in research, especially in fields prone to misuse of statistics.
Transcripts
friends hello again
this time we're gonna discuss very
quickly how people are fooled by metrics
based on two points the first one is
that metrics are random variables
one is metrics
are random variables okay when you
observe a correlation you think like
you're observing a tomato no it's a
random variable
uh i have to look at that fact the
second point
is that
metrics will be gained because they are
random variables people take the upper
bound
of metrics matrix
like random variables will
be gained
so let's see how
i take the simplest example
simplest example correlation
let's say that i have x and y
two independent random variables
simple and uncorrelated as well
independent and onward the difference
of course we saw exists for variables
outside the gaussian
i take x1 xn
and i have y1 yn
i have so n samples of each
and i run the pearson correlation
xxy between them
am i gonna get zero let's see
let's look at this uh the behavior of
the correlation so
i have x i picked a normal distribution
x 18 18 randomly distributed
variables this uses mathematica y
randomly distributed
you run it look at x look at y and
correlation is negative 36
okay negative 36 that is supposed to be
zero
do it again correlation negative 40
or luck plus 36
so uh every time you run a correlation
getting a different number but on
average
on average let's say on average
we do it what do we get
okay look at this i have a distribution
around zero with a mean of expected mean
of zero
and a variance uh i don't compute it
but should be around eyeballing maybe a
standard deviation of the quarter so
it's quite significant
and notice that you have about 16
percent of correlation above a quarter
and 16 correlation below a quarter so
you have 32
chance of getting a correlation an
absolute value higher
than 0.25 it's significant
talking about correlation of zero
okay resume now what happens if i
increase
from 18 random variables about 100
yeah look at a distribution that's more
compressed so it's going to have a lower
uh standard deviation and pretty much
like a
we're going from 18 to 100 by the fifth
square root of five
one is one over square root of five of
the other because the distribution
compresses
at root n as we saw it's the workings of
law of large numbers
for variables that are gaussian or
similar in the to the gaussian
was falling in that thin tailed finite
variance
finite all moments domain with
with without any big tail effects
so
this may be trivial now what is less
trivial
is the fact that a researcher
can go pick the best correlation okay so
let's look at
this what this fellow type weigan
figured out look at the relationship
here you see between
u.s spending on science space and
technology and suicide by
hanging strangulation and suffocation
i don't know the difference between the
two the last two
i'm sure it's significant for the matter
here
or a number of people who drown by
falling into a pool and films
in which nicholas cage has appeared
so what is the story here the story is
that the researcher
can choose between many correlation and
take the upper bound
we can very simply calculate that upper
bound
i mean there's a way to distribute to to
figure out the the distribution of that
upper bound
and derive it and i'm gonna do that
forget
what we see here i'm gonna do that on
the board
but let me show you a few examples of
aberrations that have happened
uh in in science uh based on
the misunderstanding of that point
there's two things
let me repeat the one first one we
covered is that correlation
is a stochastic variable
okay drawn from an ensemble of
correlation
the mean may be zero but what you
observe is different from zero
and two someone's gonna game it by
picking the highest correlation among
many
and do not say correlation we know it's
not
cor causation no the point is not that
correlation of that causation
is that very often correlation is not
correlation
so note very often correlation is not
correlation another way you see is
correlation you see is not the true
correlation
so let's see some of the
games and i'll skip the the the
the usual iq studies to just mention the
national
iq here you can eyeball it and know it's
bogus
one because all you have to do is change
a point and
the the the slope will flip okay
it's done by fellow uh who's visibly
obsessed with
trying to further a unionist agenda
usually the people introducingism are
not very smart
um another one here again playing the
same game nicholas walmart
i showed in the previous video how
his associate tried to associate facial
features with some attributes uh
whatever the attributes is it doesn't
show from this
okay uh look at the uh
the the what you have here the
slopes that you is getting uh are quite
pitiful
and let's see how we can replicate it
well the best way to figure out
how something random can be gained is to
game it yourself using monte carlo
so let's see i can replicate it we have
20 points
and we try to look for for the west
slope and look here you have 24 slope
uh 39 slope quite impressive
negative 35 point 48 slope
okay big association we're talking about
variables are entirely random
okay i'll be generated again
and i get very very totally random
giving me again the same story
uh here the best i got is negative
54 slope totally random okay
uh another element you need to consider
here is that
a just as with correlation the
the slope uh one finance we call beta
the relationship between x and y the
slope
uh must be interpreted
uh with some scrutiny in other words uh
just in correlation is
ten percent correlation is much much
closer to zero than it is to twenty
based on entropy and informational uh
differences between random variables the
same thing with the slope a slope of
point one is much much closer to zero
than the slope
of 0.2 and so on so this is and
really many many researchers don't know
it as a matter of fact i think most
researchers
in psychology or in some fields that
shouldn't exist
like political science uh you know using
uh
these metrics uh these these fields
should not exist
and and of course whatever data coming
out of them is
patently garbage as we are seeing let's
rapidly discuss
the n versus p
or sometimes people call it d the four
dimensionality
which is a dimensionality problem and
aspirations of
whatever metric uh you're using
the problem is as follows
in a real world we have a lot of classes
around the variable you have like in
finance we got
30 000 securities it gives you a lot of
correlation
uh half of approximately i have a 30 000
square
like something half a billion
correlations
so you realize odds are you're going to
see something
that is very high pick it up
and believe in an association randomly
even if each security or each random
variable
has a lot of observations a lot of login
so
because and
as we see helps you
in a very very slow way as we said
the mean deviation of standard deviation
decrease at square root of n
but the dimensionality is a severe
problem let's see how
let's say i have x1
i'm doing a correlation matrix x2
xp i have p random variables
how many correlations am i likely to
have or
or should i have it's got to be about
one half
p p minus one
so p p squared minus p divided by two
because think about it this would be
p square we have p square uh
[Music]
correlation on this table you remove the
diagonal and take half of the
lower one for redundancies it's a lot
and consider that if i add one
observation
i add one random variable i'm going to
have
how many correlations we're going to
have all the preceding ones
you see so every time you add a random
variable you get a lot more correlation
which is why
more and more research is stale because
observational research that you're
getting
is meaningless because people start
getting great numbers
for free of course there's supposedly
some methods to tame it
buffer only upper bound stuff like that
more technical
but research will escape it
so
what's happening here
this is spuriousness
this is n
sorry spuriousness decreases as square
root of n
okay however
this is p this is paris let's make say
the numbers various correlation i would
have in the matrix
increases something like p square
at a rate of p square
uh i'm simplifying of course
there's some miles on on a covariance
matrix
just think about it it grows in a very
complex way
which means that as the world has many
many many variables
each one time you add one you must have
a lot of ends to compensate for the
spuriousness
that's the problem of having not enough
data per variable and having too many
random variables
hopefully this will uh
give you an idea of of the biases we see
in
research there are many of course many
more biases this
uh it's quite technical i've done some
work on
uh on on the distribution of spiritual
correlation or spurious metrics
experience variances stuff like that
though that would be presented later
have an excellent weekend or an
excellent day or whatever it is
or an excellent holiday and
we'll see you in the next session
5.0 / 5 (0 votes)