Entropy (for data science) Clearly Explained!!!

StatQuest with Josh Starmer
24 Aug 202116:34

Summary

TLDRThis StatQuest episode, hosted by Josh Starmer, dives into the concept of entropy for data science, explaining its applications in classification trees, mutual information, and algorithms like t-SNE and UMAP. The video illustrates how entropy quantifies surprise and similarity, using the example of chickens in different areas to demonstrate the relationship between probability and surprise. It then explores the calculation of surprise and entropy, highlighting the importance of entropy in measuring the expected surprise per event, such as coin flips, and its significance in data science.

Takeaways

  • πŸ“š Entropy is a fundamental concept in data science used for building classification trees, mutual information, and in algorithms like t-SNE and UMAP.
  • πŸ” Entropy helps quantify similarities and differences in data, which is crucial for various machine learning applications.
  • πŸ€” The concept of entropy is rooted in the idea of 'surprise', which is inversely related to the probability of an event.
  • πŸ” The video uses the analogy of chickens of different colors in various areas to illustrate the relationship between probability and surprise.
  • βš–οΈ The level of surprise is not directly proportional to the inverse of probability due to the undefined log of zero when probability is zero.
  • πŸ“‰ To calculate surprise, the logarithm of the inverse of the probability is used, which aligns with the concept of 'information gain'.
  • 🎲 When flipping a biased coin, the surprise for getting heads or tails can be calculated using the log of the inverse of their respective probabilities.
  • πŸ”’ Entropy is the expected value of surprise, calculated as the average surprise per event over many occurrences.
  • βˆ‘ The mathematical formula for entropy involves summing the product of the probability of each outcome and the surprise (logarithm of the inverse probability) of that outcome.
  • πŸ“ˆ Entropy can be represented in sigma notation, emphasizing its role as an expected value derived from the sum of individual probabilities and their associated surprises.
  • 🌐 Entropy values can be used to compare the distribution of different categories within a dataset, with higher entropy indicating greater disorder or diversity.

Q & A

  • What is the main topic of the StatQuest video?

    -The main topic of the video is entropy in the context of data science, explaining how it is used to build classification trees, mutual information, and in algorithms like t-SNE and UMAP.

  • Why is understanding surprise important in the context of entropy?

    -Understanding surprise is important because it is inversely related to probability, which helps in quantifying how surprising an event is based on its likelihood of occurrence.

  • How does the video use chickens to illustrate the concept of surprise?

    -The video uses a scenario with two types of chickens, orange and blue, organized into different areas with varying probabilities of being picked, to demonstrate how the level of surprise correlates with the probability of an event.

  • Why can't we use the inverse of probability alone to calculate surprise?

    -Using the inverse of probability alone doesn't work because it doesn't correctly represent the surprise when the probability is very high, like when flipping a coin that always lands on heads.

  • What mathematical function is used to calculate surprise instead of just the inverse of probability?

    -The logarithm of the inverse of probability is used to calculate surprise, which gives a more accurate representation of the relationship between probability and surprise.

  • Why is the log base 2 used when calculating surprise for two outcomes?

    -The log base 2 is used for two outcomes because it is customary and it aligns with information theory principles, where entropy measures information in bits.

  • How does the video explain the concept of entropy in terms of flipping a coin?

    -The video explains entropy by calculating the average surprise per coin toss over many flips, which represents the expected surprise or entropy of the coin-flipping process.

  • What is the formula for entropy in terms of surprise and probability?

    -The formula for entropy is the sum of the product of each outcome's surprise and its probability, which can be represented using summation notation as the expected value of surprise.

  • How does the entropy value change with the distribution of chickens in different areas?

    -The entropy value changes based on the probability distribution of the chickens. Higher entropy indicates a more even distribution of chicken types, leading to a higher expected surprise per pick.

  • What is the significance of entropy in data science applications?

    -In data science, entropy is significant as it quantifies the uncertainty or surprise in a dataset, which is useful for building classification models, measuring mutual information, and in dimension reduction algorithms.

  • How does the video conclude the explanation of entropy?

    -The video concludes by demonstrating how entropy can be used to quantify the similarity or difference in the distribution of items, like chickens, and by providing a humorous note on surprising someone with the 'log of the inverse of the probability'.

Outlines

00:00

πŸ“Š Introduction to Entropy in Data Science

This paragraph introduces the concept of entropy in the context of data science, explaining its various applications such as in building classification trees and in algorithms like t-SNE and UMAP. It emphasizes that entropy is a measure of surprise, inversely related to probability, and is foundational in quantifying similarities and differences in data. The paragraph sets the stage for a deeper exploration of entropy by using the analogy of picking chickens of different colors from separate areas, illustrating how the level of surprise correlates with the probability of an event occurring.

05:01

🧐 Calculating Surprise and the Role of Probability

The second paragraph delves into the calculation of surprise, which is a precursor to understanding entropy. It discusses the relationship between probability and surprise, noting that surprise is highest when an event is least expected. The paragraph highlights the limitations of using the inverse of probability to calculate surprise, particularly when the probability is zero or one, and introduces the logarithmic approach to accurately represent the concept of surprise. It also explains how the surprise for a sequence of events is the sum of the individual surprises, providing a foundation for calculating entropy.

10:03

πŸ“‰ Understanding Entropy as Average Surprise

This paragraph explains how entropy is derived from the concept of average surprise per event. It provides a step-by-step calculation of entropy using the example of a biased coin with varying probabilities of landing heads or tails. The paragraph illustrates how to calculate the total surprise for multiple events and then derives the entropy by averaging the surprise over the number of events. It also introduces the statistical notation for entropy as the expected value of surprise, emphasizing the cancellation of event counts in the calculation process.

15:04

πŸ”„ Entropy as a Measure of Similarity and Difference

The final paragraph applies the concept of entropy to the original chicken analogy, calculating the entropy for different areas with varying ratios of orange and blue chickens. It demonstrates how entropy can be used to quantify the similarity or difference in the distribution of chickens, with higher entropy indicating a more even distribution and thus a higher expected surprise. The paragraph concludes with a humorous note on surprising someone with the 'log of the inverse of the probability' and a call to action for supporting the StatQuest channel through various means.

Mindmap

Keywords

πŸ’‘Entropy

Entropy, in the context of this video, refers to a measure from information theory that quantifies the expected surprise in a random variable or a dataset. It is central to the theme as it helps explain how data can be classified and compared in terms of its disorder or unpredictability. For example, the video uses the concept of entropy to discuss its application in building classification trees and in algorithms like t-SNE and UMAP.

πŸ’‘Data Science

Data Science is an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from data. In the video, entropy is discussed as a fundamental concept in data science, particularly for tasks like building classification trees and quantifying relationships between data points.

πŸ’‘Expected Values

Expected values are a fundamental concept in probability theory and statistics, representing the average or mean result of an experiment repeated over a large number of trials. The video assumes the viewer's familiarity with expected values as a prerequisite for understanding how entropy is calculated and used in data science.

πŸ’‘Classification Trees

Classification Trees are a non-parametric supervised learning method used for classification and regression. They are mentioned in the video as one application of entropy, where entropy helps in deciding the best splits for the tree by measuring the impurity of the data at each node.

πŸ’‘Mutual Information

Mutual Information is a measure used in statistics to quantify the amount of information obtained about one random variable through observing another random variable. The video explains that mutual information is based on entropy, and it helps in understanding the relationship between two variables.

πŸ’‘Relative Entropy

Relative Entropy, also known as the Kullback-Leibler divergence, is a measure of how one probability distribution diverges from a second, expected probability distribution. The video mentions it as a basis for comparing two probability distributions, which is derived from the concept of entropy.

πŸ’‘Cross Entropy

Cross Entropy is a measure of the difference in entropy between two probability distributions. In the video, it is mentioned as a concept that appears in various algorithms, including those for dimension reduction like t-SNE and UMAP, where it helps in measuring the difference between the distributions of data in high-dimensional and low-dimensional spaces.

πŸ’‘Surprise

Surprise, in the context of this video, is a concept related to the unexpectedness of an outcome given its probability. It is inversely related to probability, where lower probabilities result in higher surprise. The video uses the concept of surprise to explain the calculation of entropy, emphasizing how it quantifies the unpredictability of data.

πŸ’‘Logarithm

A logarithm is the inverse operation to exponentiation, used in mathematics and computer science to perform calculations with exponents. In the video, the logarithm of the inverse of probability is used to calculate surprise, which is then used to derive the concept of entropy.

πŸ’‘Coin Flip

The coin flip is used as a simple example in the video to illustrate the concepts of probability, surprise, and entropy. It demonstrates how the expected surprise (entropy) can be calculated when a coin has a biased outcome, with heads appearing more frequently than tails.

πŸ’‘Dimension Reduction

Dimension Reduction is a process in data analysis where the number of random variables under consideration is reduced, while still retaining most of the relevant information. The video mentions algorithms like t-SNE and UMAP, which use concepts derived from entropy, such as cross entropy, to perform dimension reduction effectively.

Highlights

Entropy is a concept used in data science for building classification trees and quantifying relationships.

Mutual information, relative entropy, and cross entropy are based on entropy and used in various algorithms including t-SNE and UMAP.

Entropy helps quantify similarities and differences in data.

Understanding surprise is fundamental to grasping entropy, as it is inversely related to probability.

The concept of surprise is illustrated through the example of picking chickens of different colors from various areas.

Calculating surprise involves using the log of the inverse of probability to avoid undefined values.

The log function is used to map the inverse probability to a scale where no surprise is zero.

Entropy is the average surprise per event, calculated by summing individual surprises and dividing by the number of events.

The entropy formula can be derived from the concept of expected surprise.

Entropy is represented mathematically as the expected value of surprise using sigma notation.

The standard form of entropy equation involves logarithms and probabilities, originally published by Claude Shannon.

Entropy can be used to measure the unpredictability or information content in a set of outcomes.

In the context of the chicken example, entropy quantifies the mix of orange and blue chickens in different areas.

Higher entropy indicates a more even distribution of outcomes, leading to greater surprise.

Entropy decreases as the difference in the number of two types of outcomes increases, indicating less surprise.

TheStatQuest channel offers study guides for offline review of statistics and machine learning.

Support for StatQuest can come in various forms including Patreon contributions, channel memberships, merchandise, or donations.

Transcripts

play00:00

yes you can understand

play00:04

entropy

play00:05

hooray

play00:07

steadquest

play00:09

hello i'm josh starmer and welcome to

play00:12

statquest today we're going to talk

play00:14

about entropy for data science and it's

play00:17

going to be clearly explained

play00:20

note this stat quest assumes that you

play00:22

are already familiar with the main ideas

play00:24

of expected values if not check out the

play00:27

quest

play00:29

entropy is used for a lot of things in

play00:32

data science

play00:33

for example entropy can be used to build

play00:36

classification trees

play00:38

which are used to classify things

play00:41

entropy is also the basis of something

play00:43

called mutual information which

play00:45

quantifies the relationship between two

play00:48

things

play00:49

and entropy is the basis of relative

play00:52

entropy aka the colback liebler distance

play00:55

and cross entropy

play00:57

which show up all over the place

play00:59

including fancy dimension reduction

play01:01

algorithms like t snee and umap

play01:05

what these three things have in common

play01:07

is that they all use entropy or

play01:09

something derived from it to quantify

play01:12

similarities and differences

play01:15

so let's learn how entropy quantifies

play01:17

similarities and differences

play01:20

however in order to talk about entropy

play01:23

first we have to understand surprise

play01:26

so let's talk about chickens

play01:29

imagine we had two types of chickens

play01:31

orange and blue and instead of just

play01:34

letting them randomly roam all over the

play01:37

screen

play01:38

our friends stat squats chased them

play01:40

around until they were organized into

play01:42

three separate areas a b and c

play01:46

now if statsquatch just randomly picked

play01:49

up a chicken in area a

play01:51

then because there are six orange

play01:54

chickens and only one blue chicken there

play01:56

is a higher probability that they will

play01:58

pick up an orange chicken

play02:00

and since there is a higher probability

play02:03

of picking up an orange chicken it would

play02:05

not be very surprising if they did

play02:08

in contrast if statsquatch picked up the

play02:11

blue chicken from area a we would be

play02:13

relatively surprised

play02:16

area b has a lot more blue chickens than

play02:19

orange

play02:20

and because there is now a higher

play02:22

probability of picking up a blue chicken

play02:24

we would not be very surprised if it

play02:26

happened

play02:27

and because there is a relatively low

play02:30

probability of picking the orange

play02:31

chicken

play02:32

that would be relatively surprising

play02:35

lastly area c has an equal number of

play02:38

orange and blue chickens

play02:40

thus regardless of what color chicken we

play02:43

pick up we would be equally surprised

play02:47

combined these areas tell us that

play02:49

surprise is in some way inversely

play02:52

related to probability

play02:55

in other words when the probability of

play02:58

picking up a blue chicken is low the

play03:00

surprise is high

play03:02

and when the probability of picking up a

play03:04

blue chicken is high the surprise is low

play03:07

bam

play03:09

now we have a general intuition of how

play03:11

probability is related to surprise

play03:14

now let's talk about how to calculate

play03:16

surprise

play03:18

because we know there is a type of

play03:20

inverse relationship between probability

play03:23

and surprise

play03:24

it's tempting to just use the inverse of

play03:27

probability to calculate surprise

play03:30

because when we plot the inverse we see

play03:32

that the closer the probability is to

play03:34

zero the larger the y-axis value

play03:38

however there's at least one problem

play03:41

with just using the inverse of the

play03:42

probability to calculate surprise

play03:46

to get a better sense of this problem

play03:48

let's talk about the surprise associated

play03:50

with flipping a coin

play03:52

imagine we had a terrible coin and every

play03:55

time we flipped it we got heads blah

play03:58

blah blah blah

play04:00

ugh flipping this coin is super boring

play04:04

hey statsquatch how surprised would you

play04:06

be if the next flip gave us heads

play04:10

i would not be surprised at all

play04:13

so

play04:14

when the probability of getting heads is

play04:16

one

play04:17

then we want the surprise for getting

play04:19

heads to be zero

play04:21

however when we take the inverse of the

play04:24

probability of getting heads we get one

play04:27

instead of what we want

play04:29

zero

play04:30

and this is one reason why we can't just

play04:32

use the inverse of the probability to

play04:35

calculate surprise

play04:37

so instead of just using the inverse of

play04:40

the probability to calculate surprise

play04:43

we use the log of the inverse of the

play04:45

probability

play04:47

now since the probability of getting

play04:49

heads is 1 and thus we will always get

play04:52

heads and it will never surprise us

play04:55

the surprise for heads is zero

play04:58

in contrast since the probability for

play05:01

getting tails is zero and thus will

play05:03

never get tails it doesn't make sense to

play05:06

quantify the surprise of something that

play05:08

will never happen

play05:10

so when we plug in 0 for the probability

play05:13

and use the properties of logs to turn

play05:16

the division into subtraction

play05:18

the second term is the log of 0

play05:21

and because the log of 0 is undefined

play05:24

the whole thing is undefined

play05:27

and this result is ok because we're

play05:30

talking about the surprise associated

play05:32

with something that never happens

play05:35

like the inverse of the probability the

play05:37

log of the inverse of the probability

play05:39

gives us a nice curve

play05:41

and the closer the probability gets to

play05:43

zero the more surprise we get

play05:46

but now the curve says there is no

play05:48

surprise when the probability is one

play05:51

so surprise is the log of the inverse of

play05:54

the probability

play05:56

bam

play05:58

note when calculating surprise for two

play06:01

outputs in this case the two outputs are

play06:04

heads and tails then it is customary to

play06:06

use the log base 2 for the calculations

play06:10

now that we know what surprise is

play06:12

let's imagine that our coin gets heads

play06:15

90 percent of the time

play06:17

and it gets tails 10 percent of the time

play06:20

now let's calculate the surprise for

play06:22

getting heads

play06:24

and tails

play06:26

as expected because getting tails is

play06:28

much rarer than getting heads the

play06:31

surprise for tails is much larger

play06:34

now let's flip this coin three times

play06:37

and we get heads heads and tails

play06:40

the probability of getting two heads and

play06:42

one tail is

play06:45

0.9 times 0.9 for the heads

play06:49

times 0.1 for the tails

play06:52

and if we want to know exactly how

play06:54

surprising it is to get two heads and

play06:57

one tail

play06:58

then we can plug this probability into

play07:00

the equation for surprise

play07:03

and use the properties of logs to

play07:05

convert the division into subtraction

play07:08

and use the properties of logs to

play07:10

convert the multiplication into addition

play07:13

and then plug in chug and we get 3.62

play07:19

but more importantly we see that the

play07:21

total surprise for a sequence of coin

play07:23

tosses is just the sum of the surprises

play07:26

for each individual toss

play07:29

in other words the surprise for getting

play07:31

one heads is

play07:32

0.15

play07:34

and since we got two heads we add 0.15

play07:38

two times

play07:39

plus 3.32 for the one tail

play07:43

to get the total surprise for getting

play07:45

two heads and one tail

play07:47

medium bam

play07:50

now because this diagram takes up a lot

play07:52

of space let's summarize the information

play07:55

in a table

play07:56

the first row in the table tells us the

play07:58

probability of getting heads or tails

play08:02

and the second row tells us the

play08:03

associated surprise

play08:06

now if we wanted to estimate the total

play08:09

surprise after flipping the coin 100

play08:11

times

play08:13

we approximate how many times we will

play08:15

get heads by multiplying the probability

play08:17

we will get heads 0.9 by 100

play08:22

and we estimate the total surprise from

play08:24

getting heads by multiplying by 0.15

play08:29

so this term represents how much

play08:30

surprise we expect from getting heads in

play08:33

100 coin flips

play08:36

likewise we can approximate how many

play08:38

times we will get tails by multiplying

play08:40

the probability we will get tails 0.1 by

play08:44

100

play08:46

and we estimate the total surprise from

play08:48

getting tails by multiplying by 3.32

play08:52

so the second term represents how much

play08:54

surprise we expect from getting tails in

play08:57

100 coin flips

play09:00

now we can add the two terms together to

play09:02

find out the total surprise

play09:04

and we get

play09:06

46.7 hey statsquatch is back

play09:10

ok

play09:11

i see that we just estimated the

play09:13

surprise for 100 coin flips

play09:16

but aren't we supposed to be talking

play09:17

about entropy

play09:19

funny you should ask

play09:21

if we divide everything by the number of

play09:24

coin tosses 100

play09:26

then we get the average amount of

play09:28

surprise per coin toss 0.47

play09:32

so on average we expect the surprise to

play09:36

be 0.47

play09:37

every time we flip the coin

play09:40

and that is the entropy of the coin

play09:43

the expected surprise every time we flip

play09:46

the coin

play09:48

double bam

play09:51

in fancy statistics notation we say that

play09:54

entropy is the expected value of the

play09:56

surprise

play09:58

anyway since we are multiplying each

play10:00

probability by the number of coin tosses

play10:03

100

play10:05

and also dividing by the number of coin

play10:07

tosses 100

play10:09

then all of the values that represent

play10:11

the number of coin tosses 100 cancel out

play10:16

and we are left with the probability

play10:18

that a surprise for heads will occur

play10:20

times its surprise

play10:22

plus the probability that a surprise for

play10:24

tails will occur times its surprise

play10:28

thus the entropy

play10:30

0.47

play10:32

represents the surprise we would expect

play10:34

per coin toss if we flipped this coin a

play10:37

bunch of times

play10:39

and yes expecting surprise sounds silly

play10:42

but it's not the silliest thing i've

play10:44

heard note we can rewrite entropy just

play10:48

like an expected value using fancy sigma

play10:51

notation

play10:52

the x represents a specific value for

play10:55

surprise

play10:57

times the probability of observing that

play10:59

specific value for surprise

play11:02

so for the first term getting heads

play11:06

the specific value for surprise is 0.15

play11:10

and the probability of observing that

play11:12

surprise is 0.9

play11:15

so we multiply those values together

play11:19

then the sigma tells us to add that term

play11:22

to the term for tails

play11:25

either way we do the math we get 0.47

play11:29

now personally once i saw that entropy

play11:33

was just the average surprise that we

play11:35

could expect

play11:37

entropy went from something that i had

play11:38

to memorize to something i could derive

play11:42

because now we can plug the equation for

play11:45

surprise in for x the specific value

play11:49

and we can plug in the probability

play11:51

and we end up with the equation for

play11:54

entropy

play11:56

bam

play11:59

unfortunately even though this equation

play12:02

is made from two relatively easy to

play12:04

interpret terms

play12:06

the surprise

play12:08

times the probability of the surprise

play12:11

this isn't the standard form of the

play12:13

equation for entropy that you'll see out

play12:15

in the wild

play12:17

first we have to swap the order of the

play12:19

two terms

play12:21

then we use the properties of logs to

play12:23

convert the fraction into subtraction

play12:26

and the log of one is zero

play12:29

then we multiply both terms and the

play12:31

difference by the probability

play12:34

then

play12:34

lastly we pull the minus sign out of the

play12:37

summation

play12:38

and we end up with the equation for

play12:40

entropy that claude shannon first

play12:42

published in 1948

play12:45

small bam

play12:47

that said even though this is the

play12:49

original version and the one you'll

play12:51

usually see

play12:53

i prefer this version since it is easily

play12:56

derived from surprise

play12:58

and it is easier to see what is going on

play13:02

now

play13:03

going back to the original example we

play13:05

can calculate the entropy of the

play13:07

chickens

play13:08

so let's calculate the entropy for area

play13:11

a

play13:12

because six of the seven chickens are

play13:15

orange we plug in six divided by seven

play13:17

for the probability

play13:19

then we add a term for the one blue

play13:21

chicken

play13:22

by plugging in 1 divided by 7 for the

play13:25

probability

play13:27

now we just do the math and get 0.59

play13:31

note even though the surprise associated

play13:34

with picking up an orange chicken

play13:36

is much smaller than picking up a blue

play13:38

chicken

play13:40

there's a much higher probability that

play13:42

we will pick up an orange chicken than

play13:44

pick up a blue chicken

play13:46

thus the total entropy 0.59

play13:50

is much closer to the surprise

play13:52

associated with orange chickens than

play13:54

blue chickens

play13:57

likewise we can calculate the entropy

play13:59

for area b

play14:01

only this time

play14:02

the probability of randomly picking up

play14:04

an orange chicken is 1 divided by 11

play14:07

and the probability of picking up a blue

play14:09

chicken is 10 divided by 11

play14:12

and the entropy is

play14:14

0.44

play14:16

in this case the surprise for picking up

play14:19

an orange chicken is relatively high

play14:21

but the probability of it happening is

play14:24

so low

play14:25

that the total entropy is much closer to

play14:28

the surprise associated with picking up

play14:30

a blue chicken

play14:32

we also see that the entropy value the

play14:35

expected surprise is less for area b

play14:38

than area a

play14:40

this makes sense because area b has a

play14:43

higher probability of picking a chicken

play14:45

with a lower surprise

play14:48

lastly the entropy for area c is one

play14:53

and that makes the entropy for area c

play14:56

the highest we have calculated so far

play14:59

in this case even though the surprise

play15:01

for orange and blue chickens is

play15:03

relatively moderate one

play15:06

we always get the same relatively

play15:09

moderate surprise every time we pick up

play15:11

a chicken

play15:13

and it is never outweighed by a smaller

play15:15

value for surprise like we saw earlier

play15:18

for areas a and b

play15:20

as a result we can use entropy to

play15:23

quantify the similarity or difference in

play15:26

the number of orange and blue chickens

play15:28

in each area

play15:30

entropy is highest when we have the same

play15:33

number of both types of chickens

play15:36

and as we increase the difference in the

play15:38

number of orange and blue chickens we

play15:40

lower the entropy

play15:43

triple bam

play15:46

p.s the next time you want to surprise

play15:49

someone just whisper the log of the

play15:51

inverse of the probability bam

play15:55

now it's time for some

play15:57

shameless self-promotion

play15:59

if you want to review statistics and

play16:01

machine learning offline check out the

play16:03

statquest study guides at statquest.org

play16:07

there's something for everyone

play16:09

hooray we've made it to the end of

play16:11

another exciting stat quest if you like

play16:14

this stat quest and want to see more

play16:16

please subscribe and if you want to

play16:17

support statquest consider contributing

play16:20

to my patreon campaign becoming a

play16:22

channel member buying one or two of my

play16:24

original songs or a t-shirt or a hoodie

play16:27

or just donate the links are in the

play16:29

description below

play16:30

alright until next time quest on

Rate This
β˜…
β˜…
β˜…
β˜…
β˜…

5.0 / 5 (0 votes)

Related Tags
Data ScienceEntropyMachine LearningStatQuestExpected ValueSurpriseProbabilityClassification TreesMutual InformationDimension Reduction