MINI-LESSON 6: Fooled by Metrics (Correlation)

N N Taleb's Probability Moocs
14 May 202113:46

Summary

TLDRThe video script delves into the pitfalls of relying on metrics, highlighting two main issues: the randomness of metrics and the tendency to select the most favorable data. It uses the example of Pearson correlation to illustrate how even with independent variables, correlation coefficients can deviate from zero, emphasizing the stochastic nature of such measures. The script further discusses how researchers might manipulate data to find the 'best' correlation, leading to misleading conclusions. It also touches on the problem of dimensionality, where the number of possible correlations grows exponentially with the number of variables, exacerbating the risk of spurious correlations. The speaker critiques certain fields of research for their misuse of statistical methods, advocating for a more cautious interpretation of data.

Takeaways

  • ๐Ÿ“Š Metrics are random variables: They aren't fixed and can vary widely.
  • ๐Ÿ” Correlations aren't always reliable: Observing a correlation doesn't guarantee a true relationship.
  • ๐ŸŽฒ Metrics can be manipulated: Researchers can pick the highest correlation among many to fit their narrative.
  • ๐Ÿ“‰ Example of randomness: Using random variables can yield different correlations each time, often deviating significantly from zero.
  • ๐Ÿ“ Distribution matters: With more data points, the distribution of correlation compresses, lowering standard deviation.
  • ๐Ÿ” Misleading correlations: Some studies show correlations between unrelated variables, like US spending on science and suicides.
  • ๐Ÿ“ˆ Choosing the best correlation: Researchers can exploit multiple correlations to find and report the highest one.
  • ๐ŸŽ›๏ธ Importance of large sample sizes: More data points reduce the randomness and improve reliability of correlations.
  • ๐Ÿ“Š Dimensionality problem: Increasing the number of variables leads to exponentially more correlations, increasing the chance of spurious findings.
  • ๐Ÿงฎ Spurious correlations: High numbers of correlations in large datasets can lead to false interpretations if not carefully managed.

Q & A

  • What are the two main points discussed in the transcript about metrics?

    -The two main points discussed are: 1) Metrics are random variables, and 2) Metrics can be gamed because they are random variables. People often take the upper bound of these metrics, leading to misleading conclusions.

  • Why does the speaker compare metrics to random variables?

    -The speaker compares metrics to random variables to highlight that metrics are not fixed values but can vary based on different observations. This variability can lead to different outcomes when metrics like correlation are calculated, even if the underlying variables are independent.

  • How does the speaker illustrate the randomness of correlation?

    -The speaker uses an example with two independent random variables, X and Y, and shows that calculating Pearson correlation multiple times results in different values each time. This illustrates that correlation is a stochastic variable and not a fixed measure.

  • What does the speaker mean by 'metrics will be gamed'?

    -The speaker means that researchers or analysts might selectively choose metrics or correlations that show the most favorable or extreme results, thereby misleadingly representing the data. This is possible because metrics, being random variables, can produce varying results, allowing for cherry-picking.

  • What is the significance of the 'law of large numbers' in the context of metrics?

    -The 'law of large numbers' is mentioned to explain that as the sample size increases, the distribution of the correlation becomes more compressed, leading to a more stable and less variable estimate of the correlation. This helps in reducing the randomness but doesn't eliminate it completely.

  • What example does the speaker give to show how researchers can misuse correlations?

    -The speaker gives examples of absurd correlations, such as the relationship between U.S. spending on science and technology and the number of suicides by hanging, to illustrate how researchers can misuse correlations by selecting extreme or coincidental relationships that appear significant but are actually meaningless.

  • What is the speaker's criticism of certain fields like psychology and political science?

    -The speaker criticizes these fields for relying on metrics and correlations without sufficient scrutiny, leading to findings that are often spurious or meaningless. The speaker argues that these fields produce data that are frequently unreliable because they do not account for the random nature of the metrics they use.

  • How does the speaker suggest testing the randomness of metrics?

    -The speaker suggests using Monte Carlo simulations to replicate random datasets and observe the correlations or slopes that emerge. This method helps demonstrate how entirely random data can still produce seemingly significant correlations, underscoring the importance of skepticism in interpreting such metrics.

  • What problem does the speaker highlight regarding the number of random variables in a dataset?

    -The speaker highlights that as the number of random variables (p) increases, the number of possible correlations (which grows at a rate of p squared) increases significantly. This leads to a higher chance of finding spurious correlations, making it challenging to identify genuine relationships without a large enough sample size (n) to compensate for this complexity.

  • What advice does the speaker offer for interpreting correlations or slopes in research?

    -The speaker advises that correlations and slopes should be interpreted with scrutiny. A small correlation or slope is much closer to zero than it might appear, and researchers should be cautious about drawing strong conclusions from these metrics without considering the potential for randomness and spuriousness.

Outlines

00:00

๐ŸŽฒ Metrics as Random Variables and the Pitfalls of Correlation

The first paragraph discusses how metrics are inherently random variables, making correlations observed between them similarly subject to randomness. The speaker emphasizes that metrics should not be perceived as fixed entities like 'tomatoes' but rather as stochastic elements whose results can vary. By using examples, the speaker shows how correlation values fluctuate due to their random nature. Even though the expected correlation between two independent variables is zero, actual samples often yield varying results due to inherent randomness. The law of large numbers is mentioned as a factor that reduces this variability with larger sample sizes, though the focus remains on the pitfalls of small sample correlations and the potential for these correlations to be misleading or gamed.

05:00

๐ŸŽฏ The Danger of Selecting the Best Correlation

The second paragraph highlights the issue of researchers gaming the system by selectively choosing the highest correlation from a set of correlations, a process that can be misleading. The speaker critiques the tendency of researchers to exploit the stochastic nature of correlations, emphasizing that this practice can lead to spurious or false conclusions. The example of unrelated data points, such as U.S. spending on science and technology and unrelated events like suicides or movie appearances, demonstrates how misleading correlations can arise. The speaker warns that, due to the random nature of correlation, the correlations researchers present are often not representative of true relationships.

10:01

๐Ÿงฉ Spurious Correlations and Dimensionality Problems in Research

The third paragraph delves into the problem of spurious correlations arising from high-dimensional data sets. The speaker explains that in fields with many variables, such as finance, the sheer number of potential correlations increases the likelihood of observing strong but meaningless correlations. This issue is exacerbated when adding more variables, which compounds the number of possible correlations and thus increases the risk of spurious findings. The speaker also criticizes the state of observational research, arguing that much of it is stale and meaningless due to the problems introduced by high dimensionality and insufficient data per variable. The discussion concludes with a caution about the biases present in research that stem from these statistical issues.

Mindmap

Keywords

๐Ÿ’กMetrics

Metrics refer to quantitative measures used to assess, compare, and track performance or production. In the video, the speaker emphasizes that metrics are not static figures but random variables that can fluctuate and be manipulated. This concept is crucial as it underpins the videoโ€™s main argument about the unreliability and potential misuse of metrics in research and data analysis.

๐Ÿ’กRandom Variables

A random variable is a variable whose values depend on outcomes of a random phenomenon. In the video, the speaker stresses that metrics are random variables, meaning they are subject to variability and randomness. This randomness is often misunderstood or ignored, leading to incorrect conclusions when analyzing metrics like correlation.

๐Ÿ’กCorrelation

Correlation is a statistical measure that describes the extent to which two variables change together. The video illustrates that correlation is a stochastic variable, and because of its random nature, it can yield misleading results if not properly interpreted. The speaker also critiques the common misuse of correlation in research, where correlations are often mistaken for causation or given more significance than they deserve.

๐Ÿ’กLaw of Large Numbers

The Law of Large Numbers is a principle of probability that states as a sample size grows, its mean will get closer to the average of the whole population. The video discusses how increasing the sample size reduces the variance in correlation, making it more reliable. However, this principle also highlights that small sample sizes can lead to spurious correlations.

๐Ÿ’กSpurious Correlation

A spurious correlation occurs when two variables appear to be related due to a third variable or random chance, rather than any true association. The video provides examples of nonsensical correlations (like the correlation between Nicolas Cage films and drowning accidents) to demonstrate how researchers can cherry-pick correlations that appear significant but are actually meaningless.

๐Ÿ’กDimensionality

Dimensionality refers to the number of variables or features in a dataset. The video warns about the dangers of high dimensionality, where the number of potential correlations increases rapidly with the number of variables. This can lead to the discovery of spurious correlations simply due to the large number of comparisons being made, rather than any real underlying relationship.

๐Ÿ’กPearson Correlation

Pearson Correlation is a measure of the linear relationship between two variables, ranging from -1 to 1. The video uses Pearson Correlation to demonstrate how even uncorrelated variables can show significant correlation due to randomness, especially in small sample sizes. This underlines the videoโ€™s argument about the unreliability of using correlation as a sole metric for association.

๐Ÿ’กGaming Metrics

Gaming metrics refers to the practice of manipulating or selectively presenting metrics to achieve a desired outcome. The speaker discusses how researchers can 'game' metrics by choosing the highest correlations from a set of data, thus creating a misleading impression of a significant relationship where there is none. This concept is central to the critique of research practices that rely too heavily on statistical correlations.

๐Ÿ’กStochastic Variable

A stochastic variable is one that is randomly determined and can vary each time it is observed. The video emphasizes that correlation is a stochastic variable, meaning that its value can fluctuate depending on the sample of data used. This variability is often overlooked, leading to overconfidence in the significance of observed correlations.

๐Ÿ’กMonte Carlo Simulation

Monte Carlo Simulation is a computational technique that uses repeated random sampling to obtain numerical results, often used to understand the impact of risk and uncertainty in prediction and modeling. The speaker mentions Monte Carlo simulations as a method to demonstrate how random data can produce misleadingly strong correlations, reinforcing the argument that observed correlations need to be scrutinized carefully.

Highlights

Metrics are random variables and should be treated as such, not as fixed values.

Correlation between two variables is also a random variable, not a fixed number.

Even with independent, uncorrelated variables, running a Pearson correlation can yield non-zero results due to randomness.

The distribution of correlations around zero has significant variance, with about a 32% chance of seeing a correlation above 0.25 in absolute value.

Increasing the sample size compresses the distribution of correlation values, reducing the standard deviation.

Researchers can exploit randomness by selecting the highest correlation among many tests, leading to misleading results.

The relationship between unrelated variables, such as U.S. spending on science and the number of suicides, can appear significant due to random correlation.

The slope (beta) in linear regression, like correlation, should be interpreted with caution, as small slopes are often much closer to zero than they appear.

Many researchers, particularly in fields like psychology and political science, may not fully understand the limitations of using these metrics, leading to unreliable conclusions.

The problem of dimensionality: as the number of variables (p) increases, the number of correlations grows quadratically, which increases the likelihood of spurious correlations.

In observational research, adding more variables without sufficient data leads to spurious correlations, making results less reliable.

Spuriousness in correlation decreases with the square root of sample size (n), but increases dramatically with the number of variables (p).

The challenge in research is balancing the number of variables with sufficient data to avoid spurious correlations.

The presentation highlights the importance of understanding the stochastic nature of metrics and correlations in research to avoid false conclusions.

The discussion points out the need for more rigorous methods to tame spurious correlations in research, especially in fields prone to misuse of statistics.

Transcripts

play00:00

friends hello again

play00:03

this time we're gonna discuss very

play00:05

quickly how people are fooled by metrics

play00:07

based on two points the first one is

play00:10

that metrics are random variables

play00:13

one is metrics

play00:19

are random variables okay when you

play00:23

observe a correlation you think like

play00:24

you're observing a tomato no it's a

play00:26

random variable

play00:29

uh i have to look at that fact the

play00:31

second point

play00:34

is that

play00:37

metrics will be gained because they are

play00:39

random variables people take the upper

play00:40

bound

play00:41

of metrics matrix

play00:45

like random variables will

play00:49

be gained

play00:54

so let's see how

play00:58

i take the simplest example

play01:02

simplest example correlation

play01:06

let's say that i have x and y

play01:10

two independent random variables

play01:13

simple and uncorrelated as well

play01:16

independent and onward the difference

play01:19

of course we saw exists for variables

play01:22

outside the gaussian

play01:24

i take x1 xn

play01:28

and i have y1 yn

play01:31

i have so n samples of each

play01:35

and i run the pearson correlation

play01:38

xxy between them

play01:42

am i gonna get zero let's see

play01:49

let's look at this uh the behavior of

play01:51

the correlation so

play01:52

i have x i picked a normal distribution

play01:57

x 18 18 randomly distributed

play02:01

variables this uses mathematica y

play02:04

randomly distributed

play02:08

you run it look at x look at y and

play02:10

correlation is negative 36

play02:13

okay negative 36 that is supposed to be

play02:16

zero

play02:18

do it again correlation negative 40

play02:22

or luck plus 36

play02:29

so uh every time you run a correlation

play02:32

getting a different number but on

play02:34

average

play02:35

on average let's say on average

play02:38

we do it what do we get

play02:43

okay look at this i have a distribution

play02:46

around zero with a mean of expected mean

play02:49

of zero

play02:50

and a variance uh i don't compute it

play02:53

but should be around eyeballing maybe a

play02:57

standard deviation of the quarter so

play02:59

it's quite significant

play03:03

and notice that you have about 16

play03:05

percent of correlation above a quarter

play03:08

and 16 correlation below a quarter so

play03:11

you have 32

play03:12

chance of getting a correlation an

play03:14

absolute value higher

play03:15

than 0.25 it's significant

play03:19

talking about correlation of zero

play03:23

okay resume now what happens if i

play03:26

increase

play03:27

from 18 random variables about 100

play03:32

yeah look at a distribution that's more

play03:34

compressed so it's going to have a lower

play03:37

uh standard deviation and pretty much

play03:39

like a

play03:40

we're going from 18 to 100 by the fifth

play03:42

square root of five

play03:45

one is one over square root of five of

play03:47

the other because the distribution

play03:48

compresses

play03:49

at root n as we saw it's the workings of

play03:52

law of large numbers

play03:53

for variables that are gaussian or

play03:57

similar in the to the gaussian

play04:00

was falling in that thin tailed finite

play04:03

variance

play04:04

finite all moments domain with

play04:08

with without any big tail effects

play04:12

so

play04:15

this may be trivial now what is less

play04:18

trivial

play04:19

is the fact that a researcher

play04:22

can go pick the best correlation okay so

play04:26

let's look at

play04:27

this what this fellow type weigan

play04:30

figured out look at the relationship

play04:33

here you see between

play04:35

u.s spending on science space and

play04:36

technology and suicide by

play04:38

hanging strangulation and suffocation

play04:42

i don't know the difference between the

play04:44

two the last two

play04:46

i'm sure it's significant for the matter

play04:49

here

play04:50

or a number of people who drown by

play04:51

falling into a pool and films

play04:54

in which nicholas cage has appeared

play04:58

so what is the story here the story is

play05:00

that the researcher

play05:02

can choose between many correlation and

play05:06

take the upper bound

play05:07

we can very simply calculate that upper

play05:10

bound

play05:11

i mean there's a way to distribute to to

play05:12

figure out the the distribution of that

play05:14

upper bound

play05:16

and derive it and i'm gonna do that

play05:19

forget

play05:20

what we see here i'm gonna do that on

play05:22

the board

play05:23

but let me show you a few examples of

play05:24

aberrations that have happened

play05:26

uh in in science uh based on

play05:30

the misunderstanding of that point

play05:32

there's two things

play05:33

let me repeat the one first one we

play05:36

covered is that correlation

play05:38

is a stochastic variable

play05:42

okay drawn from an ensemble of

play05:44

correlation

play05:46

the mean may be zero but what you

play05:48

observe is different from zero

play05:50

and two someone's gonna game it by

play05:53

picking the highest correlation among

play05:55

many

play05:56

and do not say correlation we know it's

play06:00

not

play06:00

cor causation no the point is not that

play06:03

correlation of that causation

play06:05

is that very often correlation is not

play06:07

correlation

play06:08

so note very often correlation is not

play06:12

correlation another way you see is

play06:14

correlation you see is not the true

play06:15

correlation

play06:17

so let's see some of the

play06:20

games and i'll skip the the the

play06:23

the usual iq studies to just mention the

play06:26

national

play06:26

iq here you can eyeball it and know it's

play06:29

bogus

play06:30

one because all you have to do is change

play06:31

a point and

play06:34

the the the slope will flip okay

play06:38

it's done by fellow uh who's visibly

play06:41

obsessed with

play06:42

trying to further a unionist agenda

play06:46

usually the people introducingism are

play06:48

not very smart

play06:50

um another one here again playing the

play06:52

same game nicholas walmart

play06:54

i showed in the previous video how

play06:58

his associate tried to associate facial

play07:00

features with some attributes uh

play07:02

whatever the attributes is it doesn't

play07:04

show from this

play07:05

okay uh look at the uh

play07:09

the the what you have here the

play07:14

slopes that you is getting uh are quite

play07:16

pitiful

play07:18

and let's see how we can replicate it

play07:19

well the best way to figure out

play07:22

how something random can be gained is to

play07:24

game it yourself using monte carlo

play07:27

so let's see i can replicate it we have

play07:30

20 points

play07:31

and we try to look for for the west

play07:33

slope and look here you have 24 slope

play07:36

uh 39 slope quite impressive

play07:40

negative 35 point 48 slope

play07:43

okay big association we're talking about

play07:45

variables are entirely random

play07:46

okay i'll be generated again

play07:50

and i get very very totally random

play07:52

giving me again the same story

play07:55

uh here the best i got is negative

play07:58

54 slope totally random okay

play08:02

uh another element you need to consider

play08:04

here is that

play08:07

a just as with correlation the

play08:10

the slope uh one finance we call beta

play08:13

the relationship between x and y the

play08:16

slope

play08:17

uh must be interpreted

play08:21

uh with some scrutiny in other words uh

play08:24

just in correlation is

play08:25

ten percent correlation is much much

play08:28

closer to zero than it is to twenty

play08:31

based on entropy and informational uh

play08:34

differences between random variables the

play08:36

same thing with the slope a slope of

play08:38

point one is much much closer to zero

play08:41

than the slope

play08:42

of 0.2 and so on so this is and

play08:46

really many many researchers don't know

play08:48

it as a matter of fact i think most

play08:50

researchers

play08:51

in psychology or in some fields that

play08:53

shouldn't exist

play08:55

like political science uh you know using

play08:57

uh

play09:00

these metrics uh these these fields

play09:04

should not exist

play09:04

and and of course whatever data coming

play09:07

out of them is

play09:08

patently garbage as we are seeing let's

play09:10

rapidly discuss

play09:13

the n versus p

play09:16

or sometimes people call it d the four

play09:18

dimensionality

play09:20

which is a dimensionality problem and

play09:21

aspirations of

play09:25

whatever metric uh you're using

play09:28

the problem is as follows

play09:33

in a real world we have a lot of classes

play09:34

around the variable you have like in

play09:36

finance we got

play09:37

30 000 securities it gives you a lot of

play09:41

correlation

play09:42

uh half of approximately i have a 30 000

play09:45

square

play09:45

like something half a billion

play09:48

correlations

play09:49

so you realize odds are you're going to

play09:52

see something

play09:53

that is very high pick it up

play09:57

and believe in an association randomly

play10:00

even if each security or each random

play10:04

variable

play10:05

has a lot of observations a lot of login

play10:07

so

play10:08

because and

play10:12

as we see helps you

play10:15

in a very very slow way as we said

play10:19

the mean deviation of standard deviation

play10:21

decrease at square root of n

play10:24

but the dimensionality is a severe

play10:27

problem let's see how

play10:29

let's say i have x1

play10:33

i'm doing a correlation matrix x2

play10:39

xp i have p random variables

play10:44

how many correlations am i likely to

play10:46

have or

play10:48

or should i have it's got to be about

play10:50

one half

play10:51

p p minus one

play10:58

so p p squared minus p divided by two

play11:01

because think about it this would be

play11:02

p square we have p square uh

play11:06

[Music]

play11:08

correlation on this table you remove the

play11:10

diagonal and take half of the

play11:13

lower one for redundancies it's a lot

play11:17

and consider that if i add one

play11:19

observation

play11:22

i add one random variable i'm going to

play11:24

have

play11:26

how many correlations we're going to

play11:27

have all the preceding ones

play11:29

you see so every time you add a random

play11:32

variable you get a lot more correlation

play11:35

which is why

play11:39

more and more research is stale because

play11:41

observational research that you're

play11:43

getting

play11:44

is meaningless because people start

play11:46

getting great numbers

play11:48

for free of course there's supposedly

play11:50

some methods to tame it

play11:52

buffer only upper bound stuff like that

play11:54

more technical

play11:56

but research will escape it

play11:59

so

play12:03

what's happening here

play12:07

this is spuriousness

play12:10

this is n

play12:14

sorry spuriousness decreases as square

play12:17

root of n

play12:19

okay however

play12:26

this is p this is paris let's make say

play12:28

the numbers various correlation i would

play12:30

have in the matrix

play12:34

increases something like p square

play12:38

at a rate of p square

play12:42

uh i'm simplifying of course

play12:46

there's some miles on on a covariance

play12:48

matrix

play12:49

just think about it it grows in a very

play12:52

complex way

play12:54

which means that as the world has many

play12:55

many many variables

play12:58

each one time you add one you must have

play13:01

a lot of ends to compensate for the

play13:03

spuriousness

play13:04

that's the problem of having not enough

play13:07

data per variable and having too many

play13:09

random variables

play13:12

hopefully this will uh

play13:16

give you an idea of of the biases we see

play13:19

in

play13:19

research there are many of course many

play13:22

more biases this

play13:23

uh it's quite technical i've done some

play13:26

work on

play13:27

uh on on the distribution of spiritual

play13:29

correlation or spurious metrics

play13:32

experience variances stuff like that

play13:34

though that would be presented later

play13:37

have an excellent weekend or an

play13:39

excellent day or whatever it is

play13:41

or an excellent holiday and

play13:44

we'll see you in the next session

Rate This
โ˜…
โ˜…
โ˜…
โ˜…
โ˜…

5.0 / 5 (0 votes)

Related Tags
Metrics AnalysisCorrelation MisinterpretationResearch BiasStatistical FlawRandom VariablesData SpuriousnessMonte CarloGaussian DistributionLaw of Large NumbersCorrelation Gaming