Lecture 6: Noisy Channel Coding (I): Inference and Information Measures for Noisy Channels

Jakob Foerster
26 Apr 201454:42

Summary

TLDR本视频讲解了信息理论课程中的数据压缩回顾及噪声信道编码的核心主题。讲师通过一个关于ABCDE集合编码的问题引入讨论,解释了随机从编码流中抽取一个比特的概率问题。接着,讲师深入探讨了Shannon信息内容、符号编码的最优长度、Huffman算法及算术编码的原理与应用。通过实际例子,如简·奥斯汀的《爱玛》文本片段,展示了PPM压缩算法的工作方式。最后,讲师讨论了利用算术编码进行高效随机样本生成的方法,并强调了随机位的价值。整个讲座通过丰富的例子和详细解析,深入浅出地介绍了信息理论中的关键概念和技术。

Takeaways

  • 📚本次课程回顾了数据压缩的内容,并继续探讨信息理论课程的核心话题——噪声信道编码。
  • 🔍介绍了一种对ABC四个符号进行编码的示例,并探讨了从编码流中随机抽取一个比特,该比特为1的概率。
  • 🤔通过投票和讨论的方式,探索了不同的概率计算方法,并指出长度在计算过程中的重要性。
  • ✅强调了信源编码定理的意义,即如果编码完全随机且无法进一步压缩,则1的概率为50%。
  • 📉讨论了算术编码的原理和优势,尤其是它如何比符号编码更有效地实现数据压缩。
  • 🎲介绍了预测模型(如PPM)在数据压缩中的应用,并通过例子说明了这些模型如何工作。
  • 🔄探讨了利用算术编码生成随机样本的方法,并比较了其效率与传统方法的区别。
  • 🔑强调了随机比特的价值,并通过实例说明了在某些情况下节省随机比特的重要性。
  • 🧩通过一系列概率问题(如三张卡片问题和三扇门问题),强化了条件概率和推理的概念。
  • 📏最后,定义了联合熵、条件熵和互信息等概念,为讨论噪声信道和信息传输提供了理论基础。

Q & A

  • 什么是数据压缩?

    -数据压缩是减少电子数据大小的过程,以便它需要更少的存储空间或带宽。这在信息理论课程中是一个重要主题。

  • 信息理论课程的中心主题是什么?

    -信息理论课程的中心主题是噪声信道编码,这是在讲座中探讨的一个重要概念,涉及如何在有干扰的信道中有效地传输信息。

  • 什么是符号编码?

    -符号编码是一种用于数据压缩的技术,它将源符号(如字符)映射到位串,以减少表示数据所需的位数。

  • 霍夫曼编码的作用是什么?

    -霍夫曼编码是一种优化的符号编码方法,它根据字符出现的频率来分配不同长度的编码,使整体编码最短。

  • 什么是算术编码?

    -算术编码是一种数据压缩技术,它通过将整个消息表示为一个在[0,1)区间内的单一数字,而不是像传统方法那样单独编码每个符号,来提高压缩效率。

  • 为什么说算术编码比符号编码更有效?

    -算术编码比符号编码更有效,因为它处理整个消息作为一个整体,减少了每个字符必须携带的额外信息量,从而接近信息的熵限。

  • 二进制对称信道是什么?

    -二进制对称信道是一种简单的通信模型,其中传输的二进制位有一定概率不会改变,也有一定概率会翻转。

  • 信息理论中的熵是如何定义的?

    -在信息理论中,熵是衡量随机变量不确定性的量度,定义为该随机变量所有可能结果的概率乘以其对数的负和。

  • 什么是条件熵?

    -条件熵是给定一个随机变量的情况下另一个随机变量不确定性的量度,它反映了在知道一个变量的情况下另一个变量剩余的不确定性。

  • 相互信息是什么?

    -相互信息是两个随机变量共享信息的量度,它衡量了了解一个变量提供的关于另一个变量的信息量。

Outlines

00:00

📊 数据压缩与噪声信道编码概述

本段介绍了数据压缩课程内容的回顾,以及将重点转移到信息理论课程的中心话题——噪声信道编码。讲述者提出了关于特定编码集合的一个问题,讨论了编码中1的概率,并通过邻座讨论和投票的形式,引导学生思考这个问题。通过对不同投票结果的分析,进一步深入探讨了如何计算特定编码中1的概率,并引出了如何根据编码的实际组成来分析和计算这一概率的方法。

05:02

🔍 探索编码中1的概率

本段继续深入探讨了如何准确计算出从编码源中随机抽取一个比特,该比特为1的概率。通过引入编码长度的考虑,指出了之前的方法可能存在的问题,并提出了考虑编码长度的正确计算方式。此外,还讨论了信息内容和编码长度的关系,说明了如果编码是完美压缩的,则其输出的比特应该是随机的,从而引出了正确答案是一半的结论。

10:02

📝 符号编码与算术编码

本段介绍了符号编码和算术编码的概念,以及它们在数据压缩中的应用。通过对比展示了算术编码相比符号编码在压缩效率上的优势,尤其是在处理具有低熵值的数据时。讲述者通过具体的例子,如简奥斯汀的《艾玛》,来说明算术编码的工作原理和实际应用中的效果,强调了预测模型(如PPM)在提高压缩效率中的作用。

15:04

🚀 算术编码的高级应用

本段探讨了算术编码在生成随机样本和高效书写系统中的应用。通过介绍一种名为Dasher的书写系统,展示了如何将算术编码应用于将手势转换为文本的过程中,提高输入效率。此外,还讨论了算术编码在模拟偏置硬币抛掷中的应用,展示了算术编码相比传统方法在生成随机样本时的位效率优势,并通过提供随机位的物理介质(如CD)来强调随机位资源的价值。

20:06

🎓 推理与概率理论入门

本段通过介绍一个关于三张卡片的简单推理谜题,引入了推理和概率理论的概念。通过讨论卡片的不同可能性及其对推理结果的影响,指出了在进行推理时使用概率理论的重要性。此外,还通过类比另一个更为人所熟知的三门问题,强调了理解问题背后逻辑而非仅仅记忆答案的重要性。

25:08

🤔 进一步探讨推理和概率

本段深入讨论了推理问题中的概率计算,通过对三张卡片问题的分析,展示了如何系统地使用概率理论来解决推理问题。通过详细的概率计算和讨论,揭示了在特定条件下反转一张卡片的概率,并用此强调了在面对推理和判断问题时,应用概率理论的重要性和有效性。

30:17

📡 噪声信道与信息度量

本段引入了噪声信道的概念,并讨论了在信息理论中度量信息的不同方法。通过介绍条件概率、联合分布和边缘分布等概念,讲述了如何计算信息的熵、条件熵和互信息等关键指标。这些指标帮助我们理解和量化在噪声信道中传输信息的效率和可靠性,为后续讨论噪声信道编码提供了理论基础。

Mindmap

Keywords

💡数据压缩

数据压缩是减少电子数据大小的过程,以便它占用更少的存储空间或带宽。在视频中,讲师通过讨论数据压缩技术来介绍信息理论的应用,如霍夫曼编码和算术编码,这些技术允许数据以接近其熵的极限压缩,提高存储和传输效率。

💡噪声信道编码

噪声信道编码是信息理论中的一个核心概念,指的是在存在干扰或噪声的情况下,如何有效地传输信息。视频中提到这是课程的中心主题之一,旨在解决如何通过噪声信道可靠传输信息的问题。

💡集合

集合在信息理论中代表一组可能的事件或符号及其对应的概率分布。视频中通过一个包含ABCD四个符号的集合示例,讨论了如何计算特定符号被编码为'1'的概率,这是理解信息传输过程中概率论应用的基础。

💡香农熵

香农熵是衡量信息量的一种度量方式,反映了信息的不确定性或复杂性。视频中通过计算给定集合的香农熵,引入了熵的概念,并展示了如何将其应用于数据压缩和信道编码,以及它如何决定编码长度的理想值。

💡前缀码

前缀码是一种特殊类型的编码,在此编码中任何字符的编码都不是其他字符编码的前缀。视频中提到,这确保了编码的唯一可解性,是设计高效压缩算法时的一个重要考虑因素。

💡算术编码

算术编码是一种高效的数据压缩方法,它不是单独编码每个符号,而是将整个消息作为一个单位进行编码,产生接近其整体信息内容熵的压缩率。视频中提到算术编码如何实现比传统符号编码更好的压缩效率。

💡条件概率

条件概率是给定一个事件发生的条件下,另一个事件发生的概率。视频中在讨论噪声信道和信息传输问题时,使用条件概率来描述给定输入情况下输出发生的概率,这是理解和设计编码方案的基础。

💡信息量

信息量是度量信息的不确定性解除的量。视频中通过讨论不同事件的信息量,如概率分布和编码长度,引入了这一概念,并解释了它如何帮助我们理解数据压缩和信道编码。

💡联合熵

联合熵是描述两个或多个随机变量整体不确定性的度量。视频中通过一个包含随机变量X和Y的例子,介绍了联合熵的概念,并解释了它如何与边缘熵和条件熵相联系,以及它在信息传输分析中的作用。

💡互信息

互信息是度量两个随机变量之间共享信息量的指标,反映了了解一个变量对了解另一个变量的帮助程度。视频中通过计算随机变量X和Y的互信息,展示了这一概念如何帮助我们理解在噪声信道中传输信息的能力。

Highlights

Introduction to data compression and recap of information theory, focusing on noisy channel coding.

Exploration of the probability of encoding a specific bit as '1' from a given source ensemble.

Voting process to understand the audience's grasp of probability calculation in information theory.

Explanation of calculating the probability of a bit being '1' through ensemble probabilities and code word fractions.

Discussion on the importance of code word length in determining the probability of encoding a bit as '1'.

Analysis of source coding theorem and its implication on data compression to its entropy limit.

Introduction to Huffman algorithm and arithmetic coding for optimal symbol codes and efficient compression.

Utilizing arithmetic coding to compress data down to within two bits of the entire file's information content.

Innovative approach to data compression using prediction by partial match (PPM) algorithm.

The process of simulating bent coin tosses efficiently using arithmetic coding.

Discussion on the importance of random bits in simulations and the inefficiency of standard methods.

Introduction to noisy channels and the concept of inference in information theory.

Exploration of conditional probabilities and entropies to understand information content in noisy channels.

Illustration of joint distribution and its importance in measuring entropy and mutual information.

Application of mutual information in noisy channels to understand communication reliability over them.

Highlighting the significance of using probability theory to accurately solve inference problems.

Transcripts

play00:03

welcome again today I'll recap what

play00:07

we've done in data compression and then

play00:10

we'll get back to the central topic if

play00:14

you like of the information theory

play00:15

course which is noisy channel coding

play00:17

which we looked at in lecture 1 let me

play00:22

start with a little question I set for

play00:26

you last time the question is about this

play00:29

ensemble ABCD with probability to half a

play00:32

quart at eight and eight and with the

play00:35

symbol code shown there and the question

play00:38

is if we are encoding from this source

play00:41

and then we reach into the encoded

play00:43

stream and pluck out one bit what's the

play00:44

probability that it's a one please talk

play00:47

to your neighbor about your answer to

play00:51

this question

play00:56

okay well let's have a quick vote there

play01:02

are three options there you've worked

play01:03

out p 1 I hope you've talked to your

play01:05

neighbor about it and I'm going to ask

play01:09

you is your people on less than half

play01:11

half or bigger than a half and you can

play01:13

also vote for Zed which is I don't know

play01:15

votes for Zed I don't know okay votes

play01:19

for a p1 is less than a half okay a few

play01:23

votes there for a vote sir be p1 is

play01:30

equal to half 11 vote for be votes per

play01:37

seat p 1 is big enough everyone else

play01:40

except the people who okay we've got one

play01:43

there and no one voted for Zed so

play01:47

everyone else must be asleep already

play01:49

okay well we've got a range of answers

play01:54

there people from community a who've got

play01:57

p 1 less than a half could a volunteer

play02:00

take us through the argument what is

play02:03

your p1 and what was your method for

play02:07

calculating p1 since you're in the

play02:11

majority okay who voted for a all right

play02:18

what was what was your p1 what was the

play02:23

value anyone unknown

play02:31

okay well can I help out let's talk

play02:38

about the fraction of the code word that

play02:45

is made of ones and see what happens if

play02:50

we look at that so coded a the code word

play02:54

for a which happens half a time has got

play03:01

0 ones in it code with B's half ones

play03:07

Code Red Sea is to third ones and D is a

play03:11

hundred percent one so I defined a thing

play03:14

called fi it's the fraction of the code

play03:16

word which is ones so a lot of people

play03:19

are voting for a but you're being very

play03:21

shy about saying why and I'm going to

play03:24

now give a possible argument that might

play03:26

give an answer that's in the a camp so

play03:29

we pick a code word first of these

play03:32

probabilities then we pick a random bit

play03:34

to see if it's a one or a zero and these

play03:37

are the probabilities of it being a one

play03:38

or a zero so here's a calculation

play03:41

probability of getting a warm I'll put a

play03:44

question mark on it cuz i want to check

play03:46

if I've got your reasoning right

play03:48

probability of getting a one is a sum

play03:50

over all the letters in the alphabet

play03:54

probability of picking that letter times

play03:58

the probability of it turning out to be

play04:01

a one when you pick a random bit from

play04:04

the code word yeah is that the sort of

play04:07

reasoning that the a community went

play04:08

thought your letter nod if talking is

play04:10

different difficult ok we're getting

play04:11

some nods alright so we add that up and

play04:14

we've got a half times zero plus a

play04:17

quarter times r + + 8 times two-thirds

play04:25

less than eight times 1 alright which

play04:31

gives us 18 + + 8 + + 8 times two-thirds

play04:41

which is 18 times two and two-thirds two

play04:51

and two-thirds is 8 over 3 which has

play04:56

given us one third okay does anyone who

play05:01

voted for a think that this is roughly

play05:03

why they voted for a hands up if that's

play05:05

so okay we were getting somewhere so

play05:09

there's a candidate answer there of a

play05:11

third but not everyone agreed we've got

play05:13

some B's and C's does anyone want to

play05:15

challenge this and say no it's wrong

play05:18

here's an argument why that's wrong

play05:23

anyone or has everyone gone over to the

play05:27

a side yeah Martin okay so Martin just

play05:38

said if you reach into the stream and

play05:41

some of the code words are long and some

play05:43

are short and you reach in and pick

play05:45

around a bit you're more likely to hit

play05:47

one of the long code words all right so

play05:50

this isn't the correct way to pick a

play05:52

random bit from the stream we need to

play05:54

take into account the lengths so that's

play05:57

a good argument okay so this is probably

play06:00

wrong at least it's the wrong method it

play06:02

may not be the wrong answer so length

play06:07

matter length of code words matter any

play06:14

other argument for an alternative answer

play06:22

okay let me give you one other argument

play06:25

for this and then we'll do the

play06:26

calculation again what's the information

play06:31

content of this probability here answer

play06:36

one bit to bits three bits and three

play06:39

bits what's the ideal encoded lengths

play06:42

for probabilities the ideal encoded

play06:45

lengths are the shannon information

play06:47

contents which are all from now on i

play06:50

will just call them the information

play06:51

content what are the length of the

play06:54

code words length I is 1 2 3 and 3 so

play06:59

the length match the information

play07:01

contents so this is a case where we have

play07:04

a symbol code with perfectly matched

play07:08

code lengths to the probabilities that

play07:12

means you can't compress any better than

play07:14

this because we have our source coding

play07:17

theorem that says you can compress down

play07:18

to the entropy and if the expect if the

play07:21

length of your code words match the

play07:22

information contents it's unbeatable

play07:24

good and this is a valid code it's a

play07:27

prefix code so this this will work now

play07:32

if we use this code to encode stuff from

play07:35

the source and if it were the case that

play07:38

the bits coming out after encoding were

play07:42

not 5050 zeros and ones completely

play07:44

random independent unrelated to each

play07:46

other if they were not 5050 then we

play07:51

could compress it some more because

play07:53

shannon says if the probabilities are

play07:55

5050 we can make codes you know with

play07:57

dust bin bags full of lottery tickets

play07:59

and all that sort of stuff we could do

play08:00

further compression but you can't

play08:02

compress anymore because we've already

play08:04

got it down to the entropy therefore

play08:06

without doing any calculation at all it

play08:08

must be the case that p1 is a half so b

play08:11

is the right answer because it's

play08:14

perfectly compressed and if it's

play08:16

literally compressed it'll look random

play08:18

ok so the right answer is B so what went

play08:22

wrong with this calculation well let's

play08:24

just redo it and do the calculation in a

play08:27

way that takes into account lengths legs

play08:30

think about it this way we've got a

play08:32

great big bucket we're going to put bits

play08:36

into the bucket each time you pick a

play08:39

codeword some more bits go in the bucket

play08:42

and some of those are ones and we're

play08:44

interested in the ratio in this bucket

play08:46

what fraction of them are 1s so let's

play08:54

give ourselves an extra column in here

play08:56

which is not the fraction that are 1s

play08:58

but the actual number that are 1s so

play09:01

the number of 1s in this code word is zero

play09:04

and this one is one and this one

play09:05

it's two and in this one it's three so

play09:11

on average when I pick another code word

play09:15

how many new bits go into the code word

play09:19

we'll put that down stairs and the

play09:22

answer is sum over i p_i l_i and on

play09:29

average how many 1s go into the bucket

play09:31

well that's sum over i p_i n_i okay

play09:38

how does that differ from what we did

play09:41

over here well f_i is the ratio of n_i

play09:45

over l_i so you can see we're using a

play09:52

different thing upstairs and to

play09:53

compensate for that we're using

play09:54

something different downstairs okay so

play09:56

this is the average number of 1s that

play09:58

go in the bucket whenever you pick a new

play10:00

code word this is the average number of

play10:02

bits that go in the ratio of those is

play10:04

going to be the fraction of bits that I

play10:07

want if we do that we have the upstairs

play10:12

thing is a half times zero plus a

play10:22

quarter x 1 + + 8 x 2 + + 8 times 3 and

play10:32

downstairs it's our old friend the

play10:34

expected length which is the entropy of

play10:37

this probability distribution which is

play10:41

is it not 7 over 4 i'm doing this from

play10:47

memory and what's up says that's the

play10:55

quarter and that's 38 and that's a

play11:02

quarter so in eighth we've got two plus

play11:06

two plus three which is seven over 80

play11:08

verse 7 over 4 which is half

play11:13

so that's the long way to get to the

play11:15

answer that a sneaky information theory

play11:17

based answer is just to take it's

play11:18

perfectly compressed case closed is

play11:21

fifty-fifty and it's not only fifty

play11:24

fifty it's got no correlations in it at

play11:27

all that there can be no correlations in

play11:29

the encode eating stream because

play11:31

otherwise we could use the information

play11:33

theory argument and say all those

play11:35

correlations could be exploited those

play11:37

dependencies any questions so we love

play11:43

information theory it provides you with

play11:45

very powerful very brief arguments for

play11:48

things that mean that you don't need to

play11:49

do calculations anymore less time let me

play11:53

just recap what we did we introduced

play11:56

symbol codes or rather the time before

play12:00

last three interview symbol codes and

play12:04

symbol codes are widely used we have a

play12:07

Huffman algorithm that gives you optimal

play12:08

symbol codes and their expected length

play12:10

are within one bit per character of the

play12:14

entropy but we argued in the previous

play12:17

lecture we could go that doesn't

play12:21

actually wrap up compression we don't

play12:22

like symbol codes because being within

play12:24

one bit per character is not very good

play12:26

if the actual entropy per character is

play12:29

something small like 0.1 bits and we

play12:31

argued that often that will be the case

play12:33

if say we're compressing English we play

play12:38

the game that gave us the idea that you

play12:39

could use identical twins at the sender

play12:41

and the receiver in order to make a

play12:43

compressor and then we looked at

play12:45

arithmetic coding and arithmetic coding

play12:47

is this way of thinking about a binary

play12:49

file as defining a single real number

play12:53

between 0 and 1 to an extremely large

play12:55

precision and any source file can be

play12:59

represented in the same way as in

play13:01

corresponding to an interval on the real

play13:03

line between 0 and 1 and I argued that

play13:10

when you do after Matic coding you will

play13:13

get compression down to within two bits

play13:15

of the information content of the entire

play13:18

file so this is in terms of its overhead

play13:20

this plus two is potentially about a

play13:22

factor of n smaller than the overhead

play13:25

you

play13:25

yet if you use symbol codes so

play13:27

arithmetic coding is in almost all cases

play13:29

far far better than symbol codes and I

play13:37

just wanted to flesh out how you'd use

play13:40

arithmetic coding in practice I I talked

play13:42

about the idea of an Oracle that being a

play13:44

piece of software which uses some

play13:46

procedure to predict the next character

play13:48

given all the characters x1 through xt

play13:51

minus 1 that have been seen and I wanted

play13:54

to just give you an example of how this

play13:56

works in real state-of-the-art

play13:57

compressors so here is a fragment of

play14:00

jane austen's emma this is the last 50

play14:05

or so characters and the context is

play14:08

alarmed at the BR 0 and the question is

play14:10

what comes next so how do real

play14:12

compression algorithms work well a

play14:16

fairly near state-of-the-art compressor

play14:19

that's very widely used is a compressor

play14:21

called ppm and here's how ppm works it's

play14:25

called prediction by partial match and

play14:27

it's slightly ad-hoc but there's a bunch

play14:30

of theoretical justifications and

play14:32

improvements we can make to it but let

play14:34

me just describe the ad hoc method

play14:36

here's the idea we look at the context

play14:38

of the last five characters of the

play14:41

document to define the context for the

play14:43

next character then we look at the

play14:46

entire file of everything that we've

play14:48

seen so far and we say have we seen that

play14:51

context before in the entire file so

play14:55

this is how we're going to then predict

play14:57

what happens next we go and find all of

play14:59

those occurrences and here they are

play15:00

shown in blue so II space PR 0 has

play15:03

happened one two three four five times

play15:06

already in Jane Austen's Emma and this

play15:09

shows what happened next there was a

play15:10

VIII IV and orp and an S so a first stab

play15:15

would be to say let's use these six gram

play15:17

statistics the raw 6pm statistics to say

play15:20

all right there's a 2 60 to 1 how many

play15:23

were the one two three four five is 75

play15:26

times so there's a 256 chance of getting

play15:28

a V next there's a one with ants of 0 a

play15:32

1 5th Chancellor p and a wampa the

play15:34

chance of an S but it also haven't seen

play15:37

that much data so we better

play15:39

allow some probability for other things

play15:41

happening how do we allow for other

play15:43

things happening well the PPM approach

play15:45

is to say let's back off to shorter

play15:49

length contexts as well and say what

play15:51

happened in context space PR 0 PR 0 r 0

play15:55

0 and in any context at all and fold in

play16:00

those other one two three four five

play16:03

possibilities in addition to the six

play16:05

Graham statistics and wait them together

play16:08

in a smartish way and that's how ppm

play16:12

works and it gives you extremely good

play16:15

compression state-of-the-art on most

play16:20

files it doesn't do as well as humans do

play16:23

human audiences like yourselves can

play16:25

predict English a lot better and clearly

play16:27

it doesn't understand grammar and

play16:29

dictionaries and and so forth but it

play16:31

does a lot of learning of raw statistics

play16:33

and that that gets you a long way

play16:35

towards good compression so that's ppm

play16:48

okay so the question was does it learn

play16:51

the six grams statistics of the document

play16:54

of the entire document and then compress

play16:56

it or does it do it on the fly and the

play16:58

important answer is it's just like you

play17:00

guys it the identical twin audiences it

play17:04

does things on the fly so when we're

play17:06

asked to predict in this context we then

play17:09

say how often has a space PR 0 happened

play17:12

already before now we don't look ahead

play17:14

into the future so the predictions are

play17:16

based only on what we have already seen

play17:18

okay any other questions there's a

play17:24

couple more things I want to mention

play17:25

about arithmetic coding one is as I

play17:28

showed you last time we can make other

play17:30

uses of arithmetic coding we can make an

play17:32

extremely efficient writing system and

play17:35

the one I showed you was called dascha

play17:37

that's based on the idea that writing

play17:41

involves making gestures of some sort

play17:43

maybe wiggling fingers or scribbling

play17:45

with a stick and we want to turn those

play17:47

gestures into text as efficiently as

play17:49

possible and compression is all about

play17:52

taking text and turning it into a bit

play17:54

string which in arithmetic coding is

play17:56

views viewed as a real number and so we

play17:58

can make a very good writing system by

play18:00

saying okay let's have the gesture just

play18:01

define a real number by steering our

play18:04

spaceship in a two-dimensional world and

play18:06

that will make us if we turn it on its

play18:08

head a very efficient writing system so

play18:10

you make your gesture and outcomes text

play18:12

so that's the dasher concept and I

play18:15

showed you that and it's free software

play18:17

and help is always welcome to make dash

play18:20

a better and there's another application

play18:23

for our automatic coding I wanted to

play18:25

mention and this is the idea of

play18:27

efficiently generating random samples

play18:30

let me give you an example let's say I

play18:33

asked you to simulate a bent coin and I

play18:37

said okay let's fix pa2 I don't know

play18:44

0.01 and PB is the rest please simulate

play18:51

10 million tosses of this bent coin and

play18:54

I will provide you with random bits a

play18:56

random number generator the question is

play18:58

how many bits will you need to use to

play19:01

generate

play19:02

10 million tosses of this coin let's say

play19:08

n from now on instead of 10 million a

play19:11

standard way of solving this problem is

play19:13

to say oh well i'll use the random bits

play19:16

you can provide me with to make real

play19:20

numbers between 0 & 1 so i'll generate

play19:24

things that i'll think of as real

play19:26

numbers uniformly distributed between 0

play19:28

& 1 that will ghazal up 32 bits per real

play19:35

number if it's a standard random number

play19:38

generator so we read in 32 bits

play19:40

interpret them as defining a number

play19:42

between 0 & 1 then we look to see if u

play19:44

is less than PA and if it is then we

play19:49

split out an A and otherwise we spit out

play19:50

to be this method will cost you 32 bits

play19:54

per character alright can we do better

play19:59

and do we care the answer is yes and

play20:02

maybe so yes we can do much better

play20:06

because we could take an arithmetic

play20:07

coder and think about it this way let's

play20:11

wipe the board imagine sticking a pin

play20:16

into the world of an arithmetic coder

play20:20

what happens if you stick a pin

play20:23

uniformly at random into this line that

play20:27

goes from 0 2 to 1 or into or into one

play20:35

of those lines what comes out what's the

play20:38

probability of what comes out if you

play20:43

stick in a pin and use it to select a

play20:45

string of text

play20:52

have a chat to your neighbor

play20:59

okay what I've drawn on the board here

play21:01

is a couple of different arithmetic

play21:02

coding worlds that run from 0 to 1 this

play21:08

is 1 for some source within alphabet

play21:10

with characters ABCD with different

play21:12

probabilities and what we noticed last

play21:15

time was the arithmetic coding method

play21:18

gives you intervals between zero and one

play21:22

whose size is the probability of the

play21:25

string equalling that particular string

play21:27

probability that x is a followed by c is

play21:30

the size of this little interval here

play21:33

here I've drawn the one for the bent

play21:35

coin this is ninety-nine percent and

play21:39

this is one percent and then we recurse

play21:44

and subdivide this in the same way and

play21:47

we get this is the little bit for a be

play21:50

in here and then here's a bit for a a B

play21:54

and and so forth alright if we come

play21:58

along and stick in a pin the probability

play22:00

that the pin will land inside this

play22:02

interval here is this probability so the

play22:05

probability that if you then read out

play22:07

the string we've got the probability it

play22:09

will begin AC is the probability that a

play22:12

string begins with AC so sticking in a

play22:14

pin at random is a way of picking a

play22:18

string from the probability distribution

play22:20

of your language model okay so what we

play22:25

can do with bent coins is make the

play22:28

arithmetic decoder encoder and decoder

play22:31

for the bent coin then bring along

play22:34

random bits random notes and ones so as

play22:38

to define where a pin goes in the the

play22:41

pin you're sticking a pin between 0 and

play22:43

1 you do that by having an infinite

play22:46

source of zeros and ones to define where

play22:49

we are on the line and so you start

play22:53

guzzling up those zeros and ones find

play22:55

out where the pin is to the first bit to

play22:59

bits three roots come over and start

play23:02

reading out bits with your decoder

play23:03

reading out A's and B's with your

play23:06

decoder how many bits is that going to

play23:08

need well on average

play23:10

it's just like compression whatever the

play23:14

binary entropy of naught point naught

play23:15

one is point one or something like that

play23:19

that's on average how many bits you will

play23:22

need so in contrast to needing 32 bits

play23:25

per coin toss you'll need this many bits

play23:31

so the arithmetic coding method for

play23:36

generating random samples for the bent

play23:38

coin is going to need binary entropy of

play23:43

point 0 1 bits per coin toss which is

play23:50

smaller than one quite a lot smaller

play23:52

than 1 so the it's going to be more than

play23:55

a fact of 32 times more efficient with

play23:57

your random bits and the random bits are

play24:00

a valuable thing you may well want to do

play24:03

this it may strike you as elegant and it

play24:05

may actually save you money and I want

play24:07

to close this discussion by just giving

play24:13

you a little bit of evidence that we do

play24:15

actually care about saving random bits

play24:17

because random bits are not free if you

play24:19

want random bits you have to get them

play24:20

from somewhere and if what you're doing

play24:23

has any sort of principal purpose you

play24:25

need to verify a good random bits and so

play24:30

here's a CD that I received in the post

play24:31

a few years ago and the CD came with a

play24:35

message saying that this CD contains

play24:38

random bits and they've been generated

play24:40

thanks to the financial support of a

play24:42

couple of US government grants and

play24:45

they're distributed under that grant

play24:48

from Florida State University to

play24:50

academics all around the world who want

play24:52

random bits so random bits are valuable

play24:54

and this is proof that serious

play24:59

researchers and funders food random bits

play25:01

as being important enough to actually

play25:02

spend US government money on so that's

play25:05

an argument for not wanting to waste

play25:07

random bits any questions okay we have

play25:13

now feel

play25:32

yes so the question is are there other

play25:37

ways of thinking why this is a bad idea

play25:38

to do with the precision of the numbers

play25:40

you need so to make this decision about

play25:42

is it an A or a B often you only need

play25:47

the first bit to already tell you okay

play25:49

we don't need to look at the next 31

play25:52

bits the first bit has already made the

play25:53

decision so we're wasting bits there and

play25:55

conversely actually if you look in

play25:58

detail at the probability that this will

play26:01

spit out a 1 it's not going to be

play26:03

exactly one percent is it it'll be some

play26:06

ratio of sort of n let's call it m / 22

play26:15

32 where m is some integer and so it's

play26:18

not actually going to perfectly nail

play26:20

exact reckon it'll be within 11 part

play26:22

into to the 32 which is pretty good but

play26:24

it won't give you the right exactly the

play26:27

right answer so yes it is wasteful

play26:32

because it's using loads of bits to make

play26:35

a decision about a single bit and

play26:38

arithmetic coding is better any other

play26:42

questions okay so we're going to now

play26:46

move on to the topic of noisy channels

play26:48

and noisy channels have inputs and

play26:51

outputs and one of the things we want to

play26:52

do with noisy channels is infer what the

play26:55

input might have been given the output

play26:57

of the channel so a theme of noisy

play27:01

channels is going to be we need to

play27:02

understand inference and once we're

play27:04

clear on how to do inference we'll also

play27:06

be talking about how to measure

play27:08

information content in the context of

play27:10

noisy channels so we know how to measure

play27:12

information content for a single random

play27:14

variable like tossing a bent coin or

play27:16

drawing a random letter from English

play27:18

it's all about log 1 over P and that's

play27:21

it but for noisy channels it's all a

play27:23

little bit more interesting because

play27:25

we've got inputs we've got outputs we're

play27:27

interested in questions

play27:28

like how much information does seeing

play27:30

this output convey about the input so we

play27:33

want to learn how to do that and start

play27:36

off with I just want to give you a

play27:38

little puzzle to think about and this it

play27:43

doesn't involve a noisy channel it just

play27:45

involves three cards and here are the

play27:47

three cards and I'll introduce you to

play27:49

them one at a time here's a card which

play27:52

is white and white on the front and the

play27:57

reverse okay here's a card that's brown

play28:01

on the front and the reverse H and

play28:08

here's a card that's brown on one side

play28:10

and white on the other so we've got

play28:16

three cards and now what I'm going to do

play28:18

is I'll shuffle them and turn them over

play28:19

randomly and hide them and draw one card

play28:23

at random and plunk it down in front of

play28:26

you so you can see it and then I'm going

play28:29

to ask you about the other side of the

play28:30

card all right so I shuffle and

play28:33

randomized are not doing any magic trick

play28:35

so anything has to everything you see is

play28:37

what's happening so we shuffle them up

play28:41

and then I randomly grab a cart and

play28:45

plunk it down like that all right and I

play28:49

show you the front and one side of the

play28:53

card and it's white and the question is

play28:56

what's on the other side the question is

play29:04

given that you're now seeing a white

play29:06

face what's the probability that the

play29:09

other face this car is white please turn

play29:12

to your neighbor and I'll have a chat

play29:13

about that question

play29:20

okay folks for Z which means I don't

play29:24

know okay we've got a few their nose

play29:28

votes for the probability is less than a

play29:32

half anyone no votes it's bigger than a

play29:36

half a few huh seven ish and votes for

play29:42

see it's half its 5050 one two three

play29:44

four five six grand ok so we have some

play29:48

clear uncertainty about this grant I

play29:51

like it when that happens so let's have

play29:55

an argument from the second community

play29:58

why is it 5050 anyone yeah

play30:17

good so the answer I just heard was by

play30:20

logic it's either this card or that card

play30:23

that's sitting down there on the desk

play30:24

and we don't have any additional

play30:26

information so it must be 5050 okay and

play30:30

we've got a load of people who disagree

play30:31

with you so would one of them like to

play30:33

give an argument why that compelling

play30:35

argument of logic was wrong yes okay so

play30:50

what I just heard was the three possible

play30:52

ways of getting a white thing facing up

play30:54

you're either looking at that side or

play30:55

that side or that side and all three of

play30:57

those possibilities are equally likely

play30:59

so either this hypothesis or that one or

play31:02

that one and one we turn it out turn it

play31:06

over we'll find out whether the other

play31:09

side is this or the other side is this

play31:11

or is that so that means it's two-thirds

play31:14

yeah two-thirds chance here's what that

play31:19

argument steered towards good does

play31:24

anyone want to add anything change your

play31:27

mind give another argument look ray yeah

play31:48

okay that's a nice argument so the

play31:51

argument said think about the other side

play31:53

of the card which is either going to end

play31:55

up being white or brown and right at the

play31:58

beginning before i bought out the cut I

play31:59

could have asked you please place a bet

play32:01

is the the face that you can't see going

play32:03

to be white or brown and we would say

play32:05

definitely 5050 okay and now the

play32:08

question is we see the front will that

play32:12

make us change our bet at all surely it

play32:15

must make a difference and that's a very

play32:19

credible argument that there is a

play32:21

dependence between the front on the back

play32:23

so surely you shouldn't be betting 5050

play32:25

anymore it doesn't this argument doesn't

play32:27

tell you what the right answer is but it

play32:28

does argue definitely not 5050 because

play32:31

we've just learned something we know

play32:33

something that we didn't know and there

play32:34

is a dependence and that's a good

play32:36

argument to all right so let me give you

play32:40

a suggestion always write down the

play32:44

probability or everything and this is a

play32:56

copyright Steve doll who is a professor

play33:01

over in the building next door write

play33:05

down the probability of Everett of

play33:06

everything all the variables in the

play33:08

problem then you can condition on what

play33:10

has happened and you will be able to

play33:11

deduce by the simple laws of probability

play33:14

what the answer to your question is so

play33:17

in this case everything is the front

play33:21

face and the reverse space and before I

play33:28

did the shuffling and drew the card out

play33:31

the probability of the front face being

play33:35

white or black and the reverse face

play33:37

being white or black the probability of

play33:41

this pair of variables being white and

play33:45

white with one plus the probability of

play33:48

it being Brandon brown was one-third and

play33:51

the probability of the other two things

play33:53

is a third because you get that like the

play33:56

third car the black white card this chap

play33:58

coming along and it's 5050 which way

play34:01

he'll go

play34:02

also the sixth chance there mystics

play34:05

chance that that is the joint

play34:07

probability of everything and now with

play34:11

this joint probability we can condition

play34:16

on data and anything else you want to

play34:19

throw in and data is what we're given

play34:25

and what we're given is we're in this

play34:26

world so today the white face came up so

play34:30

we condition on front being white and we

play34:36

can then work out the probability that

play34:38

the reverse is white for example by reno

play34:46

izing this probability distribution and

play34:48

we get to this okay this may not

play34:54

convince people who still really think

play34:56

it's 5050 dammit it's 5050 how can you

play34:58

say it's two-thirds that's rubbish and

play35:00

so I've got one final argument that I

play35:03

hope will help and then once this

play35:07

argument has been accepted maybe it will

play35:09

compel you to agree that it's a good

play35:11

idea to write down the probability of

play35:12

everything so here's the final argument

play35:16

and it's not that the reverse was indeed

play35:18

white that happened to be true but it

play35:20

didn't have to be the argument goes like

play35:23

this imagine that we play this game lots

play35:25

of times and instead of asking what's

play35:27

the probability the other face is white

play35:29

I always ask you the question what's the

play35:31

probability that the other face is the

play35:33

same as the one you're seeing now okay

play35:35

so mmm-hmm this one over st. this one

play35:40

they're the same this one they're not

play35:42

the same so two-thirds of the time the

play35:44

face will be the same on the back as it

play35:46

is on the front so when you see a white

play35:48

there's a two-thirds chance at the back

play35:50

is white and when you see a brown

play35:52

there's a two-thirds chance to the back

play35:53

is brown okay so that's not based on

play35:57

probability theory and I think it's a

play35:58

fairly compelling argument that the

play36:00

correct answer is two thirds and

play36:02

hopefully that's convinced you that we

play36:04

should use probability theory because if

play36:06

you can't actually reliably solve this

play36:09

exercise involving just one random

play36:12

variable which is which card and another

play36:14

end of variable which is

play36:15

way up it went so it's a two random

play36:17

barrel problem if a smart audience of

play36:20

graduates from Cambridge can't reliably

play36:23

solve this problem just say oh yeah I

play36:25

know it's fine I've got the answer that

play36:28

really shows you yeah inference is a

play36:31

little bit tricky inference isn't

play36:33

totally straightforward humans are often

play36:35

very good at inference but if you want

play36:36

to be sure you're getting it right use

play36:38

probability theory because probability

play36:40

theory will always get you the right

play36:41

answer okay all right so let's now talk

play36:49

some more about noisy channels and

play36:53

actually let me just remind myself what

play36:56

comes up let's talk through through this

play36:58

what where we're heading is we're going

play37:01

to talk a bit bit more about inference

play37:02

and we're also going to talk about

play37:05

information measures for noisy channels

play37:06

our classic noisy channel is the binary

play37:09

symmetric Channel up there top right

play37:11

which is obscured by a wireless icon

play37:16

okay go away poof okay the binary

play37:24

symmetric channel with input 0 and 1 and

play37:27

it flips a fraction f of the bits so

play37:29

we'll talk a lot about that channel in

play37:31

due course but we want to understand

play37:32

coding theory for any noisy channel and

play37:36

whenever we're dealing with noisy

play37:37

channels we need to know how to do

play37:39

inference and I used to introduce people

play37:40

to inference with a different puzzle

play37:42

instead of the three cards I would show

play37:44

people the three Doors problem where the

play37:47

game show host says here's the rules of

play37:49

the game I'm going to hide a very

play37:51

desirable prize here a national rail

play37:53

pass behind one of these three doors the

play37:58

game show host always explains the rules

play38:00

first before the game happens and the

play38:01

rules out I will hide the prize behind

play38:03

one of these doors then I will ask you

play38:06

the player to choose a door or you have

play38:08

a free choice and you choose it by

play38:09

naming it and we don't open it then I

play38:13

the game show host guarantee that I will

play38:15

open another of the doors not the one

play38:17

you chose and I guarantee when I do that

play38:19

promise me I promise you trust me that

play38:22

the prize will not be revealed at that

play38:24

stage so then the prize is clearly

play38:27

either behind do

play38:29

one or door to in this example shown

play38:31

here door one being the one you chosen

play38:33

door to being the other door that you

play38:35

didn't open and then the player gets the

play38:38

chance to stick or switch if you want

play38:40

and then you'll get what's behind the

play38:44

final door you end up at after either

play38:46

sticking or switching and the options

play38:49

for this puzzle then worked like this

play38:52

that was the rules being explained now

play38:54

now you go ahead and play the game the

play38:57

player chooses door one the host says in

play39:00

accordance with the rules i will now

play39:01

open either door two or three and will

play39:03

not reveal the price and as promised he

play39:05

opened the door is still three it

play39:08

doesn't reveal the price and the

play39:10

question is should you stick should you

play39:16

switch or does it make no difference and

play39:28

i used to use this as my introductory

play39:31

example on probability theory because

play39:32

people would argue very hotly about in

play39:34

say well it's either behind door one

play39:36

door one or two it's 5050 yeah this md

play39:41

door doesn't make any difference it's

play39:42

5050 and other people would say oh no no

play39:45

no you should switch and some people

play39:47

would say you should stick and it used

play39:50

to be contentious but unfortunately most

play39:52

people have heard of this puzzle and so

play39:54

it doesn't really work anymore but let's

play39:56

have a quick vote votes for a votes for

play39:59

be you should switch votes will see it

play40:02

makes no difference so you see it's a

play40:03

useless educational device it's been

play40:07

ruined because everyone's gone and

play40:09

talked about it and the really annoying

play40:11

thing is they've talked about it in a

play40:13

way such that no one has actually learnt

play40:15

anything and you are my proof of that

play40:18

because a moment ago you voted for a lot

play40:21

of you you're not all the proof but this

play40:24

lot yep this controversy here between B

play40:27

and C proves that add educational

play40:32

disaster has happened the three Doors

play40:33

puzzle which is a fantastic puzzle it's

play40:36

really educational has been ruined

play40:38

because everyone now knows the answer

play40:39

but they don't understand

play40:40

and it ABC here these are exactly

play40:44

equivalent to each other the three cards

play40:46

and the three doors they are the same

play40:48

problem as each other and yet I showed

play40:52

you the same problem and you didn't get

play40:53

it right even though you'd allegedly

play40:54

learnt the answer so learning isn't

play40:58

about memorizing things it's about

play40:59

understanding so that's my little rant

play41:01

on the three Doors so the message people

play41:06

should be getting from the three Doors

play41:08

problem is don't memorize the answer to

play41:09

a stupid puzzle because then you'll be

play41:11

useless at solving future inference

play41:13

problems instead learn the message that

play41:15

you should use probability theory and

play41:17

then you will be equipped to solve any

play41:19

puzzle of this type in the future ok

play41:22

rant over so what we do now let's talk

play41:26

about noisy channels what we're going to

play41:31

do with noisy channels is they're always

play41:33

going to have an input and an output and

play41:35

a channel defines a set of conditional

play41:40

distributions if you conditioned on an

play41:41

input the channel defines what the

play41:43

probability of the output is going to be

play41:45

the channel doesn't define a Joint

play41:47

Distribution it just defines a set of

play41:50

conditional distributions and we can

play41:53

only actually do inference if we've got

play41:55

the probability of everything so we need

play41:57

a joint distribution so I'm going to run

play42:00

through a very simple example where I

play42:02

assume I've got a joint distribution

play42:07

okay where are we yeah

play42:13

I'll write down a joint distribution and

play42:15

I'll define some information measures

play42:17

for that joint distribution then when we

play42:21

move on to channels we'll have to get a

play42:24

joint distribution by adding to the

play42:26

conditional distributions defined by the

play42:28

channel a distribution on the inputs so

play42:31

that's how its we're going to join up

play42:32

I'll start with a joint distribution or

play42:35

a joint ensemble and here's an example

play42:41

of one and i'll define all the different

play42:45

information measures we can make for

play42:49

this on joint ensemble so i'm going to

play42:51

tell you the joint probability of two

play42:53

random variables x and y which are

play42:55

dependent random variables a bit like

play42:57

the front and the back of the three

play42:59

cards so the first variable x can take

play43:02

on values 1 2 3 or 4 y can take on

play43:06

values one two three or four and the

play43:10

joint probabilities are 18 1 16 1 30 to

play43:16

1 30 tooth 1 16 18 1 30 to 1 30 tooth

play43:27

16th all across and then a quater & 0

play43:34

and 0 and 0 from a joint distribution

play43:39

you can always define marginals and

play43:42

marginal probabilities are written in

play43:47

the margins and they are obtained by

play43:49

adding up and here we have a quarter a

play43:52

quarter a quarter and a quarter as it

play43:54

happens for the probability of Y and

play43:57

then in the bottom margin we have a half

play44:00

one quarter 18 18 now we already defined

play44:11

for a single random variable how to

play44:13

measure its entropy and so when we've

play44:17

got a joint ensemble there's a bunch of

play44:18

straightforward things we can do with

play44:20

that entropy definition

play44:29

we can write down what the marginal

play44:34

entropy of X is and we can write down

play44:36

what the marginal entropy of y is for

play44:40

example so I'm defining for youth by

play44:44

example what marginal entropy means

play44:46

muzzle entry of X is the entropy of this

play44:49

distribution here and that is 7 over 4

play44:53

bits the marginal entropy of why is the

play44:57

entropy of this distribution and that's

play44:59

two bits and we can define the entropy

play45:04

of the pair X Y so you could think of

play45:08

the pair X Y as being a single random

play45:10

variable that takes on these 16 possible

play45:13

values with the 16 probabilities written

play45:16

half and you can say what's the entropy

play45:18

of this distribution here so we just do

play45:22

normal sum over all pairs X Y Z of XY

play45:28

log base 2 1 over P of X and Y ok and

play45:36

that comes out to 27 / 8 bits for this

play45:42

joint distribution and those numbers

play45:47

I've just written down these three

play45:48

numbers have the property that if you

play45:50

show them graphically and draw H of X

play45:57

that way and h of x and y this way and

play46:04

then you try and make H of Y nudge up

play46:07

against this right hand side here you

play46:12

find they don't quite match up so H of X

play46:15

and H and of Y added together are a bit

play46:17

bigger than h of x and y for this

play46:21

example and this is actually always true

play46:24

that these guys will always overlap or

play46:26

they might just up a butt up against

play46:29

each other and exactly add up to the

play46:31

joint entropy and they only ever do that

play46:33

if x and y are independent random

play46:36

variables

play46:37

so the size of this overlap here is a

play46:41

measure of how much dependence there is

play46:43

between the random variables let me

play46:51

define a few more things we can do with

play46:52

the Joint Distribution we can define

play46:54

conditional probabilities for example p

play47:02

of x given Y is defined to be the joint

play47:06

probability of x and y divided by

play47:08

probability of Y and we can write it out

play47:14

for all 16 values of X&Y so here is p of

play47:27

x given y 4 y going from 1 2 3 4 and x

play47:33

going 1 2 3 4 so if we conditioned on

play47:37

why being one then we can read out that

play47:40

top row probability and normalize it and

play47:44

we get a half a quarter 1818 the next

play47:50

row goes a quarter a half 18 18 when we

play47:53

normalize it when we normalize the third

play47:55

row they're all equal so we get a

play47:57

quarter a quarter a quarter on quarter

play48:01

when we normalize the bottom row we get

play48:05

1000 so those for probability vectors

play48:09

are the four possible conditional

play48:11

distributions and for every one of those

play48:15

probability distributions we can write

play48:17

down its entropy so we can write down

play48:18

the entropy of x given that Y is equal

play48:23

to 1 we can write an epi of x given that

play48:26

y equals 2 and so forth and the entropy

play48:30

of this guy here this distribution it's

play48:33

fairly familiar it's seven over for this

play48:36

17 / 4 as well this one and we have the

play48:40

uniform distribution is two bits and the

play48:44

entropy of one and three zeros is zero

play48:48

bits cuz it's a certain just

play48:51

ution we know condition on why being for

play48:53

that X is definitely one so what do we

play48:57

think about that well we notice that

play49:00

it's possible for the conditional

play49:04

entropy of x given a particular value of

play49:08

y to be the same as what we started with

play49:13

7 over 4 or the same or bigger or

play49:18

smaller so you when you get data about

play49:22

why it's possible for your uncertainty

play49:25

about X to either go up stay the same or

play49:28

get less so that's an interesting

play49:31

observation and something we can do with

play49:35

all of these who is we could say on

play49:36

average when we learn why how much

play49:39

information content will X then have so

play49:43

what's the average entropy of x given Y

play49:46

so I'm now defining for you a thing

play49:48

called the conditional entropy all of

play49:57

these were conditional entropy zaz well

play49:59

there were four particular conditional

play50:02

entropy s for particular values of Y and

play50:04

then this is the really important one

play50:05

the conditional entropy which is the

play50:07

average averaging over all possible

play50:09

values of Y waiting together the entropy

play50:13

of x given that value of y notice I'm

play50:16

carefully consistently using little

play50:18

white amine a particular value and

play50:20

capital y is the name of the random

play50:24

variable itself okay and that adds up to

play50:29

11 / 8 bits similarly we can work out

play50:37

what the entropy of Y given X is

play50:42

similarly go through the calculation and

play50:47

you get 13 / 8 bits now

play50:56

in this case I pointed out that the

play50:59

entropy of X when you learn why could

play51:02

stay the same could go up or could go

play51:04

down where we worked out the conditional

play51:07

entropy we got 11 / 8 bits and that is

play51:10

smaller than 7 over 4 and let me now

play51:14

tell you a theorem which is that it is

play51:18

always the case that the conditional

play51:21

entropy with capital letters is less

play51:23

than or equal to the marginal entropy

play51:26

and finally it can all be glued together

play51:29

like this in a single diagram so this is

play51:35

not at all obvious but it's true that H

play51:38

of Y given X fits here and h of x given

play51:48

Y it's here and thus we have rather nice

play51:54

word stories that we can tell which is

play51:56

the total information that you get when

play51:58

you learn x and y is the sum of the

play52:00

information you get when you learn y

play52:02

plus the amount of information you then

play52:04

get when having already learned why you

play52:06

then go on to learn X or the information

play52:10

content of x and y on average is the

play52:12

information content of learning x by

play52:13

itself then already knowing X learning

play52:16

why ok so the sum of these two is this

play52:19

thing here and this overlap here as I

play52:23

said is a measure of dependence between

play52:25

these variables the relationship I just

play52:29

said about this Plus this equaling that

play52:31

some people call it the chain rule for

play52:34

entropy the final thing I want to do is

play52:38

define this creature half this measure

play52:43

of dependence between x and y is going

play52:46

to be a lot of fun when we play with

play52:48

channels it's going to be the most

play52:51

important thing we want to know and it's

play52:53

called the mutual information between

play52:56

the random variables x and y we call it

play53:00

I of X semicolon why

play53:05

and it's what it shows in the picture so

play53:08

you can pick any way of obtaining this

play53:12

from the picture for example it's the

play53:13

difference between H of X and H of x

play53:17

given Y or it's the difference between H

play53:24

of Y and H of Y given X and that's the

play53:35

mutual information what we're going to

play53:39

do in the next lecture is glue this

play53:43

together we're the channel a channel is

play53:47

a set of conditional distributions so

play53:52

for the binary symmetric channel it

play53:55

doesn't specify how often you should use

play53:57

a zero and the one it just says if you

play54:00

send a zero there's a ninety percent

play54:02

chance you'll get a zero out if you send

play54:04

a one of the ninety percent chance

play54:05

you'll get a one app so the channel

play54:07

defines conditional distributions and

play54:09

then we can turn those conditional

play54:12

distributions if we want to into a joint

play54:14

distribution on input and output by

play54:16

inventing a probability distribution for

play54:19

the input then you can compute all of

play54:21

these information measures for the Joint

play54:23

Distribution you've defined and when

play54:25

we've done that we can start talking

play54:26

about the theory of how to communicate

play54:28

reliably over noisy channels are there

play54:32

any questions okay thanks very much for

play54:36

coming and see you next week

Rate This

5.0 / 5 (0 votes)

相关标签
数据压缩信息论编码理论概率论信道编码学术讲座教育资源技术探索知识分享编程教育