Lecture 6: Noisy Channel Coding (I): Inference and Information Measures for Noisy Channels
Summary
TLDR本视频讲解了信息理论课程中的数据压缩回顾及噪声信道编码的核心主题。讲师通过一个关于ABCDE集合编码的问题引入讨论,解释了随机从编码流中抽取一个比特的概率问题。接着,讲师深入探讨了Shannon信息内容、符号编码的最优长度、Huffman算法及算术编码的原理与应用。通过实际例子,如简·奥斯汀的《爱玛》文本片段,展示了PPM压缩算法的工作方式。最后,讲师讨论了利用算术编码进行高效随机样本生成的方法,并强调了随机位的价值。整个讲座通过丰富的例子和详细解析,深入浅出地介绍了信息理论中的关键概念和技术。
Takeaways
- 📚本次课程回顾了数据压缩的内容,并继续探讨信息理论课程的核心话题——噪声信道编码。
- 🔍介绍了一种对ABC四个符号进行编码的示例,并探讨了从编码流中随机抽取一个比特,该比特为1的概率。
- 🤔通过投票和讨论的方式,探索了不同的概率计算方法,并指出长度在计算过程中的重要性。
- ✅强调了信源编码定理的意义,即如果编码完全随机且无法进一步压缩,则1的概率为50%。
- 📉讨论了算术编码的原理和优势,尤其是它如何比符号编码更有效地实现数据压缩。
- 🎲介绍了预测模型(如PPM)在数据压缩中的应用,并通过例子说明了这些模型如何工作。
- 🔄探讨了利用算术编码生成随机样本的方法,并比较了其效率与传统方法的区别。
- 🔑强调了随机比特的价值,并通过实例说明了在某些情况下节省随机比特的重要性。
- 🧩通过一系列概率问题(如三张卡片问题和三扇门问题),强化了条件概率和推理的概念。
- 📏最后,定义了联合熵、条件熵和互信息等概念,为讨论噪声信道和信息传输提供了理论基础。
Q & A
什么是数据压缩?
-数据压缩是减少电子数据大小的过程,以便它需要更少的存储空间或带宽。这在信息理论课程中是一个重要主题。
信息理论课程的中心主题是什么?
-信息理论课程的中心主题是噪声信道编码,这是在讲座中探讨的一个重要概念,涉及如何在有干扰的信道中有效地传输信息。
什么是符号编码?
-符号编码是一种用于数据压缩的技术,它将源符号(如字符)映射到位串,以减少表示数据所需的位数。
霍夫曼编码的作用是什么?
-霍夫曼编码是一种优化的符号编码方法,它根据字符出现的频率来分配不同长度的编码,使整体编码最短。
什么是算术编码?
-算术编码是一种数据压缩技术,它通过将整个消息表示为一个在[0,1)区间内的单一数字,而不是像传统方法那样单独编码每个符号,来提高压缩效率。
为什么说算术编码比符号编码更有效?
-算术编码比符号编码更有效,因为它处理整个消息作为一个整体,减少了每个字符必须携带的额外信息量,从而接近信息的熵限。
二进制对称信道是什么?
-二进制对称信道是一种简单的通信模型,其中传输的二进制位有一定概率不会改变,也有一定概率会翻转。
信息理论中的熵是如何定义的?
-在信息理论中,熵是衡量随机变量不确定性的量度,定义为该随机变量所有可能结果的概率乘以其对数的负和。
什么是条件熵?
-条件熵是给定一个随机变量的情况下另一个随机变量不确定性的量度,它反映了在知道一个变量的情况下另一个变量剩余的不确定性。
相互信息是什么?
-相互信息是两个随机变量共享信息的量度,它衡量了了解一个变量提供的关于另一个变量的信息量。
Outlines
📊 数据压缩与噪声信道编码概述
本段介绍了数据压缩课程内容的回顾,以及将重点转移到信息理论课程的中心话题——噪声信道编码。讲述者提出了关于特定编码集合的一个问题,讨论了编码中1的概率,并通过邻座讨论和投票的形式,引导学生思考这个问题。通过对不同投票结果的分析,进一步深入探讨了如何计算特定编码中1的概率,并引出了如何根据编码的实际组成来分析和计算这一概率的方法。
🔍 探索编码中1的概率
本段继续深入探讨了如何准确计算出从编码源中随机抽取一个比特,该比特为1的概率。通过引入编码长度的考虑,指出了之前的方法可能存在的问题,并提出了考虑编码长度的正确计算方式。此外,还讨论了信息内容和编码长度的关系,说明了如果编码是完美压缩的,则其输出的比特应该是随机的,从而引出了正确答案是一半的结论。
📝 符号编码与算术编码
本段介绍了符号编码和算术编码的概念,以及它们在数据压缩中的应用。通过对比展示了算术编码相比符号编码在压缩效率上的优势,尤其是在处理具有低熵值的数据时。讲述者通过具体的例子,如简奥斯汀的《艾玛》,来说明算术编码的工作原理和实际应用中的效果,强调了预测模型(如PPM)在提高压缩效率中的作用。
🚀 算术编码的高级应用
本段探讨了算术编码在生成随机样本和高效书写系统中的应用。通过介绍一种名为Dasher的书写系统,展示了如何将算术编码应用于将手势转换为文本的过程中,提高输入效率。此外,还讨论了算术编码在模拟偏置硬币抛掷中的应用,展示了算术编码相比传统方法在生成随机样本时的位效率优势,并通过提供随机位的物理介质(如CD)来强调随机位资源的价值。
🎓 推理与概率理论入门
本段通过介绍一个关于三张卡片的简单推理谜题,引入了推理和概率理论的概念。通过讨论卡片的不同可能性及其对推理结果的影响,指出了在进行推理时使用概率理论的重要性。此外,还通过类比另一个更为人所熟知的三门问题,强调了理解问题背后逻辑而非仅仅记忆答案的重要性。
🤔 进一步探讨推理和概率
本段深入讨论了推理问题中的概率计算,通过对三张卡片问题的分析,展示了如何系统地使用概率理论来解决推理问题。通过详细的概率计算和讨论,揭示了在特定条件下反转一张卡片的概率,并用此强调了在面对推理和判断问题时,应用概率理论的重要性和有效性。
📡 噪声信道与信息度量
本段引入了噪声信道的概念,并讨论了在信息理论中度量信息的不同方法。通过介绍条件概率、联合分布和边缘分布等概念,讲述了如何计算信息的熵、条件熵和互信息等关键指标。这些指标帮助我们理解和量化在噪声信道中传输信息的效率和可靠性,为后续讨论噪声信道编码提供了理论基础。
Mindmap
Keywords
💡数据压缩
💡噪声信道编码
💡集合
💡香农熵
💡前缀码
💡算术编码
💡条件概率
💡信息量
💡联合熵
💡互信息
Highlights
Introduction to data compression and recap of information theory, focusing on noisy channel coding.
Exploration of the probability of encoding a specific bit as '1' from a given source ensemble.
Voting process to understand the audience's grasp of probability calculation in information theory.
Explanation of calculating the probability of a bit being '1' through ensemble probabilities and code word fractions.
Discussion on the importance of code word length in determining the probability of encoding a bit as '1'.
Analysis of source coding theorem and its implication on data compression to its entropy limit.
Introduction to Huffman algorithm and arithmetic coding for optimal symbol codes and efficient compression.
Utilizing arithmetic coding to compress data down to within two bits of the entire file's information content.
Innovative approach to data compression using prediction by partial match (PPM) algorithm.
The process of simulating bent coin tosses efficiently using arithmetic coding.
Discussion on the importance of random bits in simulations and the inefficiency of standard methods.
Introduction to noisy channels and the concept of inference in information theory.
Exploration of conditional probabilities and entropies to understand information content in noisy channels.
Illustration of joint distribution and its importance in measuring entropy and mutual information.
Application of mutual information in noisy channels to understand communication reliability over them.
Highlighting the significance of using probability theory to accurately solve inference problems.
Transcripts
welcome again today I'll recap what
we've done in data compression and then
we'll get back to the central topic if
you like of the information theory
course which is noisy channel coding
which we looked at in lecture 1 let me
start with a little question I set for
you last time the question is about this
ensemble ABCD with probability to half a
quart at eight and eight and with the
symbol code shown there and the question
is if we are encoding from this source
and then we reach into the encoded
stream and pluck out one bit what's the
probability that it's a one please talk
to your neighbor about your answer to
this question
okay well let's have a quick vote there
are three options there you've worked
out p 1 I hope you've talked to your
neighbor about it and I'm going to ask
you is your people on less than half
half or bigger than a half and you can
also vote for Zed which is I don't know
votes for Zed I don't know okay votes
for a p1 is less than a half okay a few
votes there for a vote sir be p1 is
equal to half 11 vote for be votes per
seat p 1 is big enough everyone else
except the people who okay we've got one
there and no one voted for Zed so
everyone else must be asleep already
okay well we've got a range of answers
there people from community a who've got
p 1 less than a half could a volunteer
take us through the argument what is
your p1 and what was your method for
calculating p1 since you're in the
majority okay who voted for a all right
what was what was your p1 what was the
value anyone unknown
okay well can I help out let's talk
about the fraction of the code word that
is made of ones and see what happens if
we look at that so coded a the code word
for a which happens half a time has got
0 ones in it code with B's half ones
Code Red Sea is to third ones and D is a
hundred percent one so I defined a thing
called fi it's the fraction of the code
word which is ones so a lot of people
are voting for a but you're being very
shy about saying why and I'm going to
now give a possible argument that might
give an answer that's in the a camp so
we pick a code word first of these
probabilities then we pick a random bit
to see if it's a one or a zero and these
are the probabilities of it being a one
or a zero so here's a calculation
probability of getting a warm I'll put a
question mark on it cuz i want to check
if I've got your reasoning right
probability of getting a one is a sum
over all the letters in the alphabet
probability of picking that letter times
the probability of it turning out to be
a one when you pick a random bit from
the code word yeah is that the sort of
reasoning that the a community went
thought your letter nod if talking is
different difficult ok we're getting
some nods alright so we add that up and
we've got a half times zero plus a
quarter times r + + 8 times two-thirds
less than eight times 1 alright which
gives us 18 + + 8 + + 8 times two-thirds
which is 18 times two and two-thirds two
and two-thirds is 8 over 3 which has
given us one third okay does anyone who
voted for a think that this is roughly
why they voted for a hands up if that's
so okay we were getting somewhere so
there's a candidate answer there of a
third but not everyone agreed we've got
some B's and C's does anyone want to
challenge this and say no it's wrong
here's an argument why that's wrong
anyone or has everyone gone over to the
a side yeah Martin okay so Martin just
said if you reach into the stream and
some of the code words are long and some
are short and you reach in and pick
around a bit you're more likely to hit
one of the long code words all right so
this isn't the correct way to pick a
random bit from the stream we need to
take into account the lengths so that's
a good argument okay so this is probably
wrong at least it's the wrong method it
may not be the wrong answer so length
matter length of code words matter any
other argument for an alternative answer
okay let me give you one other argument
for this and then we'll do the
calculation again what's the information
content of this probability here answer
one bit to bits three bits and three
bits what's the ideal encoded lengths
for probabilities the ideal encoded
lengths are the shannon information
contents which are all from now on i
will just call them the information
content what are the length of the
code words length I is 1 2 3 and 3 so
the length match the information
contents so this is a case where we have
a symbol code with perfectly matched
code lengths to the probabilities that
means you can't compress any better than
this because we have our source coding
theorem that says you can compress down
to the entropy and if the expect if the
length of your code words match the
information contents it's unbeatable
good and this is a valid code it's a
prefix code so this this will work now
if we use this code to encode stuff from
the source and if it were the case that
the bits coming out after encoding were
not 5050 zeros and ones completely
random independent unrelated to each
other if they were not 5050 then we
could compress it some more because
shannon says if the probabilities are
5050 we can make codes you know with
dust bin bags full of lottery tickets
and all that sort of stuff we could do
further compression but you can't
compress anymore because we've already
got it down to the entropy therefore
without doing any calculation at all it
must be the case that p1 is a half so b
is the right answer because it's
perfectly compressed and if it's
literally compressed it'll look random
ok so the right answer is B so what went
wrong with this calculation well let's
just redo it and do the calculation in a
way that takes into account lengths legs
think about it this way we've got a
great big bucket we're going to put bits
into the bucket each time you pick a
codeword some more bits go in the bucket
and some of those are ones and we're
interested in the ratio in this bucket
what fraction of them are 1s so let's
give ourselves an extra column in here
which is not the fraction that are 1s
but the actual number that are 1s so
the number of 1s in this code word is zero
and this one is one and this one
it's two and in this one it's three so
on average when I pick another code word
how many new bits go into the code word
we'll put that down stairs and the
answer is sum over i p_i l_i and on
average how many 1s go into the bucket
well that's sum over i p_i n_i okay
how does that differ from what we did
over here well f_i is the ratio of n_i
over l_i so you can see we're using a
different thing upstairs and to
compensate for that we're using
something different downstairs okay so
this is the average number of 1s that
go in the bucket whenever you pick a new
code word this is the average number of
bits that go in the ratio of those is
going to be the fraction of bits that I
want if we do that we have the upstairs
thing is a half times zero plus a
quarter x 1 + + 8 x 2 + + 8 times 3 and
downstairs it's our old friend the
expected length which is the entropy of
this probability distribution which is
is it not 7 over 4 i'm doing this from
memory and what's up says that's the
quarter and that's 38 and that's a
quarter so in eighth we've got two plus
two plus three which is seven over 80
verse 7 over 4 which is half
so that's the long way to get to the
answer that a sneaky information theory
based answer is just to take it's
perfectly compressed case closed is
fifty-fifty and it's not only fifty
fifty it's got no correlations in it at
all that there can be no correlations in
the encode eating stream because
otherwise we could use the information
theory argument and say all those
correlations could be exploited those
dependencies any questions so we love
information theory it provides you with
very powerful very brief arguments for
things that mean that you don't need to
do calculations anymore less time let me
just recap what we did we introduced
symbol codes or rather the time before
last three interview symbol codes and
symbol codes are widely used we have a
Huffman algorithm that gives you optimal
symbol codes and their expected length
are within one bit per character of the
entropy but we argued in the previous
lecture we could go that doesn't
actually wrap up compression we don't
like symbol codes because being within
one bit per character is not very good
if the actual entropy per character is
something small like 0.1 bits and we
argued that often that will be the case
if say we're compressing English we play
the game that gave us the idea that you
could use identical twins at the sender
and the receiver in order to make a
compressor and then we looked at
arithmetic coding and arithmetic coding
is this way of thinking about a binary
file as defining a single real number
between 0 and 1 to an extremely large
precision and any source file can be
represented in the same way as in
corresponding to an interval on the real
line between 0 and 1 and I argued that
when you do after Matic coding you will
get compression down to within two bits
of the information content of the entire
file so this is in terms of its overhead
this plus two is potentially about a
factor of n smaller than the overhead
you
yet if you use symbol codes so
arithmetic coding is in almost all cases
far far better than symbol codes and I
just wanted to flesh out how you'd use
arithmetic coding in practice I I talked
about the idea of an Oracle that being a
piece of software which uses some
procedure to predict the next character
given all the characters x1 through xt
minus 1 that have been seen and I wanted
to just give you an example of how this
works in real state-of-the-art
compressors so here is a fragment of
jane austen's emma this is the last 50
or so characters and the context is
alarmed at the BR 0 and the question is
what comes next so how do real
compression algorithms work well a
fairly near state-of-the-art compressor
that's very widely used is a compressor
called ppm and here's how ppm works it's
called prediction by partial match and
it's slightly ad-hoc but there's a bunch
of theoretical justifications and
improvements we can make to it but let
me just describe the ad hoc method
here's the idea we look at the context
of the last five characters of the
document to define the context for the
next character then we look at the
entire file of everything that we've
seen so far and we say have we seen that
context before in the entire file so
this is how we're going to then predict
what happens next we go and find all of
those occurrences and here they are
shown in blue so II space PR 0 has
happened one two three four five times
already in Jane Austen's Emma and this
shows what happened next there was a
VIII IV and orp and an S so a first stab
would be to say let's use these six gram
statistics the raw 6pm statistics to say
all right there's a 2 60 to 1 how many
were the one two three four five is 75
times so there's a 256 chance of getting
a V next there's a one with ants of 0 a
1 5th Chancellor p and a wampa the
chance of an S but it also haven't seen
that much data so we better
allow some probability for other things
happening how do we allow for other
things happening well the PPM approach
is to say let's back off to shorter
length contexts as well and say what
happened in context space PR 0 PR 0 r 0
0 and in any context at all and fold in
those other one two three four five
possibilities in addition to the six
Graham statistics and wait them together
in a smartish way and that's how ppm
works and it gives you extremely good
compression state-of-the-art on most
files it doesn't do as well as humans do
human audiences like yourselves can
predict English a lot better and clearly
it doesn't understand grammar and
dictionaries and and so forth but it
does a lot of learning of raw statistics
and that that gets you a long way
towards good compression so that's ppm
okay so the question was does it learn
the six grams statistics of the document
of the entire document and then compress
it or does it do it on the fly and the
important answer is it's just like you
guys it the identical twin audiences it
does things on the fly so when we're
asked to predict in this context we then
say how often has a space PR 0 happened
already before now we don't look ahead
into the future so the predictions are
based only on what we have already seen
okay any other questions there's a
couple more things I want to mention
about arithmetic coding one is as I
showed you last time we can make other
uses of arithmetic coding we can make an
extremely efficient writing system and
the one I showed you was called dascha
that's based on the idea that writing
involves making gestures of some sort
maybe wiggling fingers or scribbling
with a stick and we want to turn those
gestures into text as efficiently as
possible and compression is all about
taking text and turning it into a bit
string which in arithmetic coding is
views viewed as a real number and so we
can make a very good writing system by
saying okay let's have the gesture just
define a real number by steering our
spaceship in a two-dimensional world and
that will make us if we turn it on its
head a very efficient writing system so
you make your gesture and outcomes text
so that's the dasher concept and I
showed you that and it's free software
and help is always welcome to make dash
a better and there's another application
for our automatic coding I wanted to
mention and this is the idea of
efficiently generating random samples
let me give you an example let's say I
asked you to simulate a bent coin and I
said okay let's fix pa2 I don't know
0.01 and PB is the rest please simulate
10 million tosses of this bent coin and
I will provide you with random bits a
random number generator the question is
how many bits will you need to use to
generate
10 million tosses of this coin let's say
n from now on instead of 10 million a
standard way of solving this problem is
to say oh well i'll use the random bits
you can provide me with to make real
numbers between 0 & 1 so i'll generate
things that i'll think of as real
numbers uniformly distributed between 0
& 1 that will ghazal up 32 bits per real
number if it's a standard random number
generator so we read in 32 bits
interpret them as defining a number
between 0 & 1 then we look to see if u
is less than PA and if it is then we
split out an A and otherwise we spit out
to be this method will cost you 32 bits
per character alright can we do better
and do we care the answer is yes and
maybe so yes we can do much better
because we could take an arithmetic
coder and think about it this way let's
wipe the board imagine sticking a pin
into the world of an arithmetic coder
what happens if you stick a pin
uniformly at random into this line that
goes from 0 2 to 1 or into or into one
of those lines what comes out what's the
probability of what comes out if you
stick in a pin and use it to select a
string of text
have a chat to your neighbor
okay what I've drawn on the board here
is a couple of different arithmetic
coding worlds that run from 0 to 1 this
is 1 for some source within alphabet
with characters ABCD with different
probabilities and what we noticed last
time was the arithmetic coding method
gives you intervals between zero and one
whose size is the probability of the
string equalling that particular string
probability that x is a followed by c is
the size of this little interval here
here I've drawn the one for the bent
coin this is ninety-nine percent and
this is one percent and then we recurse
and subdivide this in the same way and
we get this is the little bit for a be
in here and then here's a bit for a a B
and and so forth alright if we come
along and stick in a pin the probability
that the pin will land inside this
interval here is this probability so the
probability that if you then read out
the string we've got the probability it
will begin AC is the probability that a
string begins with AC so sticking in a
pin at random is a way of picking a
string from the probability distribution
of your language model okay so what we
can do with bent coins is make the
arithmetic decoder encoder and decoder
for the bent coin then bring along
random bits random notes and ones so as
to define where a pin goes in the the
pin you're sticking a pin between 0 and
1 you do that by having an infinite
source of zeros and ones to define where
we are on the line and so you start
guzzling up those zeros and ones find
out where the pin is to the first bit to
bits three roots come over and start
reading out bits with your decoder
reading out A's and B's with your
decoder how many bits is that going to
need well on average
it's just like compression whatever the
binary entropy of naught point naught
one is point one or something like that
that's on average how many bits you will
need so in contrast to needing 32 bits
per coin toss you'll need this many bits
so the arithmetic coding method for
generating random samples for the bent
coin is going to need binary entropy of
point 0 1 bits per coin toss which is
smaller than one quite a lot smaller
than 1 so the it's going to be more than
a fact of 32 times more efficient with
your random bits and the random bits are
a valuable thing you may well want to do
this it may strike you as elegant and it
may actually save you money and I want
to close this discussion by just giving
you a little bit of evidence that we do
actually care about saving random bits
because random bits are not free if you
want random bits you have to get them
from somewhere and if what you're doing
has any sort of principal purpose you
need to verify a good random bits and so
here's a CD that I received in the post
a few years ago and the CD came with a
message saying that this CD contains
random bits and they've been generated
thanks to the financial support of a
couple of US government grants and
they're distributed under that grant
from Florida State University to
academics all around the world who want
random bits so random bits are valuable
and this is proof that serious
researchers and funders food random bits
as being important enough to actually
spend US government money on so that's
an argument for not wanting to waste
random bits any questions okay we have
now feel
yes so the question is are there other
ways of thinking why this is a bad idea
to do with the precision of the numbers
you need so to make this decision about
is it an A or a B often you only need
the first bit to already tell you okay
we don't need to look at the next 31
bits the first bit has already made the
decision so we're wasting bits there and
conversely actually if you look in
detail at the probability that this will
spit out a 1 it's not going to be
exactly one percent is it it'll be some
ratio of sort of n let's call it m / 22
32 where m is some integer and so it's
not actually going to perfectly nail
exact reckon it'll be within 11 part
into to the 32 which is pretty good but
it won't give you the right exactly the
right answer so yes it is wasteful
because it's using loads of bits to make
a decision about a single bit and
arithmetic coding is better any other
questions okay so we're going to now
move on to the topic of noisy channels
and noisy channels have inputs and
outputs and one of the things we want to
do with noisy channels is infer what the
input might have been given the output
of the channel so a theme of noisy
channels is going to be we need to
understand inference and once we're
clear on how to do inference we'll also
be talking about how to measure
information content in the context of
noisy channels so we know how to measure
information content for a single random
variable like tossing a bent coin or
drawing a random letter from English
it's all about log 1 over P and that's
it but for noisy channels it's all a
little bit more interesting because
we've got inputs we've got outputs we're
interested in questions
like how much information does seeing
this output convey about the input so we
want to learn how to do that and start
off with I just want to give you a
little puzzle to think about and this it
doesn't involve a noisy channel it just
involves three cards and here are the
three cards and I'll introduce you to
them one at a time here's a card which
is white and white on the front and the
reverse okay here's a card that's brown
on the front and the reverse H and
here's a card that's brown on one side
and white on the other so we've got
three cards and now what I'm going to do
is I'll shuffle them and turn them over
randomly and hide them and draw one card
at random and plunk it down in front of
you so you can see it and then I'm going
to ask you about the other side of the
card all right so I shuffle and
randomized are not doing any magic trick
so anything has to everything you see is
what's happening so we shuffle them up
and then I randomly grab a cart and
plunk it down like that all right and I
show you the front and one side of the
card and it's white and the question is
what's on the other side the question is
given that you're now seeing a white
face what's the probability that the
other face this car is white please turn
to your neighbor and I'll have a chat
about that question
okay folks for Z which means I don't
know okay we've got a few their nose
votes for the probability is less than a
half anyone no votes it's bigger than a
half a few huh seven ish and votes for
see it's half its 5050 one two three
four five six grand ok so we have some
clear uncertainty about this grant I
like it when that happens so let's have
an argument from the second community
why is it 5050 anyone yeah
good so the answer I just heard was by
logic it's either this card or that card
that's sitting down there on the desk
and we don't have any additional
information so it must be 5050 okay and
we've got a load of people who disagree
with you so would one of them like to
give an argument why that compelling
argument of logic was wrong yes okay so
what I just heard was the three possible
ways of getting a white thing facing up
you're either looking at that side or
that side or that side and all three of
those possibilities are equally likely
so either this hypothesis or that one or
that one and one we turn it out turn it
over we'll find out whether the other
side is this or the other side is this
or is that so that means it's two-thirds
yeah two-thirds chance here's what that
argument steered towards good does
anyone want to add anything change your
mind give another argument look ray yeah
okay that's a nice argument so the
argument said think about the other side
of the card which is either going to end
up being white or brown and right at the
beginning before i bought out the cut I
could have asked you please place a bet
is the the face that you can't see going
to be white or brown and we would say
definitely 5050 okay and now the
question is we see the front will that
make us change our bet at all surely it
must make a difference and that's a very
credible argument that there is a
dependence between the front on the back
so surely you shouldn't be betting 5050
anymore it doesn't this argument doesn't
tell you what the right answer is but it
does argue definitely not 5050 because
we've just learned something we know
something that we didn't know and there
is a dependence and that's a good
argument to all right so let me give you
a suggestion always write down the
probability or everything and this is a
copyright Steve doll who is a professor
over in the building next door write
down the probability of Everett of
everything all the variables in the
problem then you can condition on what
has happened and you will be able to
deduce by the simple laws of probability
what the answer to your question is so
in this case everything is the front
face and the reverse space and before I
did the shuffling and drew the card out
the probability of the front face being
white or black and the reverse face
being white or black the probability of
this pair of variables being white and
white with one plus the probability of
it being Brandon brown was one-third and
the probability of the other two things
is a third because you get that like the
third car the black white card this chap
coming along and it's 5050 which way
he'll go
also the sixth chance there mystics
chance that that is the joint
probability of everything and now with
this joint probability we can condition
on data and anything else you want to
throw in and data is what we're given
and what we're given is we're in this
world so today the white face came up so
we condition on front being white and we
can then work out the probability that
the reverse is white for example by reno
izing this probability distribution and
we get to this okay this may not
convince people who still really think
it's 5050 dammit it's 5050 how can you
say it's two-thirds that's rubbish and
so I've got one final argument that I
hope will help and then once this
argument has been accepted maybe it will
compel you to agree that it's a good
idea to write down the probability of
everything so here's the final argument
and it's not that the reverse was indeed
white that happened to be true but it
didn't have to be the argument goes like
this imagine that we play this game lots
of times and instead of asking what's
the probability the other face is white
I always ask you the question what's the
probability that the other face is the
same as the one you're seeing now okay
so mmm-hmm this one over st. this one
they're the same this one they're not
the same so two-thirds of the time the
face will be the same on the back as it
is on the front so when you see a white
there's a two-thirds chance at the back
is white and when you see a brown
there's a two-thirds chance to the back
is brown okay so that's not based on
probability theory and I think it's a
fairly compelling argument that the
correct answer is two thirds and
hopefully that's convinced you that we
should use probability theory because if
you can't actually reliably solve this
exercise involving just one random
variable which is which card and another
end of variable which is
way up it went so it's a two random
barrel problem if a smart audience of
graduates from Cambridge can't reliably
solve this problem just say oh yeah I
know it's fine I've got the answer that
really shows you yeah inference is a
little bit tricky inference isn't
totally straightforward humans are often
very good at inference but if you want
to be sure you're getting it right use
probability theory because probability
theory will always get you the right
answer okay all right so let's now talk
some more about noisy channels and
actually let me just remind myself what
comes up let's talk through through this
what where we're heading is we're going
to talk a bit bit more about inference
and we're also going to talk about
information measures for noisy channels
our classic noisy channel is the binary
symmetric Channel up there top right
which is obscured by a wireless icon
okay go away poof okay the binary
symmetric channel with input 0 and 1 and
it flips a fraction f of the bits so
we'll talk a lot about that channel in
due course but we want to understand
coding theory for any noisy channel and
whenever we're dealing with noisy
channels we need to know how to do
inference and I used to introduce people
to inference with a different puzzle
instead of the three cards I would show
people the three Doors problem where the
game show host says here's the rules of
the game I'm going to hide a very
desirable prize here a national rail
pass behind one of these three doors the
game show host always explains the rules
first before the game happens and the
rules out I will hide the prize behind
one of these doors then I will ask you
the player to choose a door or you have
a free choice and you choose it by
naming it and we don't open it then I
the game show host guarantee that I will
open another of the doors not the one
you chose and I guarantee when I do that
promise me I promise you trust me that
the prize will not be revealed at that
stage so then the prize is clearly
either behind do
one or door to in this example shown
here door one being the one you chosen
door to being the other door that you
didn't open and then the player gets the
chance to stick or switch if you want
and then you'll get what's behind the
final door you end up at after either
sticking or switching and the options
for this puzzle then worked like this
that was the rules being explained now
now you go ahead and play the game the
player chooses door one the host says in
accordance with the rules i will now
open either door two or three and will
not reveal the price and as promised he
opened the door is still three it
doesn't reveal the price and the
question is should you stick should you
switch or does it make no difference and
i used to use this as my introductory
example on probability theory because
people would argue very hotly about in
say well it's either behind door one
door one or two it's 5050 yeah this md
door doesn't make any difference it's
5050 and other people would say oh no no
no you should switch and some people
would say you should stick and it used
to be contentious but unfortunately most
people have heard of this puzzle and so
it doesn't really work anymore but let's
have a quick vote votes for a votes for
be you should switch votes will see it
makes no difference so you see it's a
useless educational device it's been
ruined because everyone's gone and
talked about it and the really annoying
thing is they've talked about it in a
way such that no one has actually learnt
anything and you are my proof of that
because a moment ago you voted for a lot
of you you're not all the proof but this
lot yep this controversy here between B
and C proves that add educational
disaster has happened the three Doors
puzzle which is a fantastic puzzle it's
really educational has been ruined
because everyone now knows the answer
but they don't understand
and it ABC here these are exactly
equivalent to each other the three cards
and the three doors they are the same
problem as each other and yet I showed
you the same problem and you didn't get
it right even though you'd allegedly
learnt the answer so learning isn't
about memorizing things it's about
understanding so that's my little rant
on the three Doors so the message people
should be getting from the three Doors
problem is don't memorize the answer to
a stupid puzzle because then you'll be
useless at solving future inference
problems instead learn the message that
you should use probability theory and
then you will be equipped to solve any
puzzle of this type in the future ok
rant over so what we do now let's talk
about noisy channels what we're going to
do with noisy channels is they're always
going to have an input and an output and
a channel defines a set of conditional
distributions if you conditioned on an
input the channel defines what the
probability of the output is going to be
the channel doesn't define a Joint
Distribution it just defines a set of
conditional distributions and we can
only actually do inference if we've got
the probability of everything so we need
a joint distribution so I'm going to run
through a very simple example where I
assume I've got a joint distribution
okay where are we yeah
I'll write down a joint distribution and
I'll define some information measures
for that joint distribution then when we
move on to channels we'll have to get a
joint distribution by adding to the
conditional distributions defined by the
channel a distribution on the inputs so
that's how its we're going to join up
I'll start with a joint distribution or
a joint ensemble and here's an example
of one and i'll define all the different
information measures we can make for
this on joint ensemble so i'm going to
tell you the joint probability of two
random variables x and y which are
dependent random variables a bit like
the front and the back of the three
cards so the first variable x can take
on values 1 2 3 or 4 y can take on
values one two three or four and the
joint probabilities are 18 1 16 1 30 to
1 30 tooth 1 16 18 1 30 to 1 30 tooth
16th all across and then a quater & 0
and 0 and 0 from a joint distribution
you can always define marginals and
marginal probabilities are written in
the margins and they are obtained by
adding up and here we have a quarter a
quarter a quarter and a quarter as it
happens for the probability of Y and
then in the bottom margin we have a half
one quarter 18 18 now we already defined
for a single random variable how to
measure its entropy and so when we've
got a joint ensemble there's a bunch of
straightforward things we can do with
that entropy definition
we can write down what the marginal
entropy of X is and we can write down
what the marginal entropy of y is for
example so I'm defining for youth by
example what marginal entropy means
muzzle entry of X is the entropy of this
distribution here and that is 7 over 4
bits the marginal entropy of why is the
entropy of this distribution and that's
two bits and we can define the entropy
of the pair X Y so you could think of
the pair X Y as being a single random
variable that takes on these 16 possible
values with the 16 probabilities written
half and you can say what's the entropy
of this distribution here so we just do
normal sum over all pairs X Y Z of XY
log base 2 1 over P of X and Y ok and
that comes out to 27 / 8 bits for this
joint distribution and those numbers
I've just written down these three
numbers have the property that if you
show them graphically and draw H of X
that way and h of x and y this way and
then you try and make H of Y nudge up
against this right hand side here you
find they don't quite match up so H of X
and H and of Y added together are a bit
bigger than h of x and y for this
example and this is actually always true
that these guys will always overlap or
they might just up a butt up against
each other and exactly add up to the
joint entropy and they only ever do that
if x and y are independent random
variables
so the size of this overlap here is a
measure of how much dependence there is
between the random variables let me
define a few more things we can do with
the Joint Distribution we can define
conditional probabilities for example p
of x given Y is defined to be the joint
probability of x and y divided by
probability of Y and we can write it out
for all 16 values of X&Y so here is p of
x given y 4 y going from 1 2 3 4 and x
going 1 2 3 4 so if we conditioned on
why being one then we can read out that
top row probability and normalize it and
we get a half a quarter 1818 the next
row goes a quarter a half 18 18 when we
normalize it when we normalize the third
row they're all equal so we get a
quarter a quarter a quarter on quarter
when we normalize the bottom row we get
1000 so those for probability vectors
are the four possible conditional
distributions and for every one of those
probability distributions we can write
down its entropy so we can write down
the entropy of x given that Y is equal
to 1 we can write an epi of x given that
y equals 2 and so forth and the entropy
of this guy here this distribution it's
fairly familiar it's seven over for this
17 / 4 as well this one and we have the
uniform distribution is two bits and the
entropy of one and three zeros is zero
bits cuz it's a certain just
ution we know condition on why being for
that X is definitely one so what do we
think about that well we notice that
it's possible for the conditional
entropy of x given a particular value of
y to be the same as what we started with
7 over 4 or the same or bigger or
smaller so you when you get data about
why it's possible for your uncertainty
about X to either go up stay the same or
get less so that's an interesting
observation and something we can do with
all of these who is we could say on
average when we learn why how much
information content will X then have so
what's the average entropy of x given Y
so I'm now defining for you a thing
called the conditional entropy all of
these were conditional entropy zaz well
there were four particular conditional
entropy s for particular values of Y and
then this is the really important one
the conditional entropy which is the
average averaging over all possible
values of Y waiting together the entropy
of x given that value of y notice I'm
carefully consistently using little
white amine a particular value and
capital y is the name of the random
variable itself okay and that adds up to
11 / 8 bits similarly we can work out
what the entropy of Y given X is
similarly go through the calculation and
you get 13 / 8 bits now
in this case I pointed out that the
entropy of X when you learn why could
stay the same could go up or could go
down where we worked out the conditional
entropy we got 11 / 8 bits and that is
smaller than 7 over 4 and let me now
tell you a theorem which is that it is
always the case that the conditional
entropy with capital letters is less
than or equal to the marginal entropy
and finally it can all be glued together
like this in a single diagram so this is
not at all obvious but it's true that H
of Y given X fits here and h of x given
Y it's here and thus we have rather nice
word stories that we can tell which is
the total information that you get when
you learn x and y is the sum of the
information you get when you learn y
plus the amount of information you then
get when having already learned why you
then go on to learn X or the information
content of x and y on average is the
information content of learning x by
itself then already knowing X learning
why ok so the sum of these two is this
thing here and this overlap here as I
said is a measure of dependence between
these variables the relationship I just
said about this Plus this equaling that
some people call it the chain rule for
entropy the final thing I want to do is
define this creature half this measure
of dependence between x and y is going
to be a lot of fun when we play with
channels it's going to be the most
important thing we want to know and it's
called the mutual information between
the random variables x and y we call it
I of X semicolon why
and it's what it shows in the picture so
you can pick any way of obtaining this
from the picture for example it's the
difference between H of X and H of x
given Y or it's the difference between H
of Y and H of Y given X and that's the
mutual information what we're going to
do in the next lecture is glue this
together we're the channel a channel is
a set of conditional distributions so
for the binary symmetric channel it
doesn't specify how often you should use
a zero and the one it just says if you
send a zero there's a ninety percent
chance you'll get a zero out if you send
a one of the ninety percent chance
you'll get a one app so the channel
defines conditional distributions and
then we can turn those conditional
distributions if we want to into a joint
distribution on input and output by
inventing a probability distribution for
the input then you can compute all of
these information measures for the Joint
Distribution you've defined and when
we've done that we can start talking
about the theory of how to communicate
reliably over noisy channels are there
any questions okay thanks very much for
coming and see you next week
Ver Más Videos Relacionados
Lecture 1 Video 1: Motivation and the basic problem
Intro to Algorithms: Crash Course Computer Science #13
Nuxt Instance Unavailable - Async Code in Vue and Nuxt with the Composition API (+ raffle 🎁)
Build an AI code generator w/ RAG to write working LangChain
Lecture 1.1 — Why do we need machine learning — [ Deep Learning | Geoffrey Hinton | UofT ]
【Python入門 #1】超簡単!Pythonの環境構築をしよう
5.0 / 5 (0 votes)