Perceptron Training
Summary
TLDRThis script delves into machine learning's weight determination, focusing on the Perceptron Rule and gradient descent. It simplifies the learning process by treating the threshold as a weight, allowing for easier weight adjustments. The Perceptron Rule is particularly highlighted for its ability to find a solution for linearly separable data sets in a finite number of iterations. The discussion touches on the algorithm's simplicity and effectiveness, and the potential challenge of determining linear separability in higher dimensions.
Takeaways
- 😀 The script discusses the need for machine learning systems to automatically find weights that map inputs to outputs, rather than setting them by hand.
- 🔍 Two rules are introduced for determining weights from training examples: the Perceptron Rule and the Delta Rule (gradient descent).
- 🧠 The Perceptron Rule uses threshold outputs, while the Delta Rule uses unthreshold values, indicating different approaches to learning.
- 🔄 The script explains the Perceptron Rule for setting weights of a single unit to match a training set, emphasizing iterative weight modification.
- 📉 A learning rate is introduced for adjusting weights, with a special mention of learning the threshold (Theta) by treating it as another weight.
- 🔄 The concept of a 'bias unit' is introduced to simplify the handling of the threshold in weight updates.
- 📊 The script outlines the process of updating weights based on the difference between the target output and the network's current output.
- 🚫 It is highlighted that if the output is correct, there will be no change to the weights, but if incorrect, the weights will be adjusted in the direction needed to reduce error.
- 📉 The Perceptron Rule is particularly effective for linearly separable data sets, where it can find a separating hyperplane in a finite number of iterations.
- ⏱ The script touches on the challenge of determining when to stop the algorithm if the data set is not linearly separable, hinting at the complexity of this decision.
Q & A
What are the two rules mentioned for setting the weights in machine learning?
-The two rules mentioned for setting the weights in machine learning are the Perceptron Rule and gradient descent or the Delta Rule.
How does the Perceptron Rule differ from gradient descent?
-The Perceptron Rule uses threshold outputs, while gradient descent uses unthreshold values.
What is the purpose of the bias unit in the context of the Perceptron Rule?
-The bias unit simplifies the learning process by allowing the threshold to be treated as another weight, effectively turning the comparison to zero instead of a specific threshold value.
How does the weight update work in the Perceptron Rule?
-The weight update in the Perceptron Rule is based on the difference between the target output and the network's current output. If the output is incorrect, the weights are adjusted in the direction that would reduce the error.
What is the role of the learning rate in the Perceptron Rule?
-The learning rate in the Perceptron Rule controls the size of the step taken in the direction of reducing the error, preventing overshoot and ensuring gradual convergence.
What does it mean for a dataset to be linearly separable?
-A dataset is considered linearly separable if there exists a hyperplane that can perfectly separate the positive and negative examples.
How does the Perceptron Rule handle linearly separable data?
-If the data is linearly separable, the Perceptron Rule will find a set of weights that correctly classify all examples in a finite number of iterations.
What happens if the data is not linearly separable?
-If the data is not linearly separable, the Perceptron Rule will not converge to a solution, and the algorithm will continue to iterate without finding a set of weights that can separate the data.
How can you determine if the data is linearly separable using the Perceptron Rule?
-You can determine if the data is linearly separable by running the Perceptron Rule and checking if it ever stops. If it stops, the data is linearly separable; if it does not stop, it is not.
What is the significance of the halting problem in the context of the Perceptron Rule?
-The halting problem refers to the challenge of determining whether a program will finish running or continue to run forever. In the context of the Perceptron Rule, solving the halting problem would allow us to know when to stop the algorithm and declare the data set not linearly separable.
Outlines
🤖 Introduction to Machine Learning Weight Setting
The paragraph introduces the concept of setting machine learning model weights automatically through training examples, rather than manually. It contrasts two methods for determining weights: the Perceptron Rule and the Delta Rule (gradient descent). The Perceptron Rule uses threshold outputs, while the Delta Rule uses unthreshold values. The focus then shifts to the Perceptron Rule for setting the weights of a single unit to match a training set. The training set consists of input vectors (x) and desired output values (y). The process involves iteratively modifying the weights to capture the training data. A learning rate for the weights is mentioned, and a trick to learn the threshold (Theta) by treating it as another weight is discussed. This simplifies the process by allowing the threshold to be treated like a weight, eliminating the need to compare against a separate threshold value.
📊 The Perceptron Learning Rule and Linear Separability
This paragraph delves into the Perceptron Learning Rule, explaining how it updates weights based on the difference between the target output and the network's current output. The weight update is influenced by the learning rate and the input value, ensuring that the model doesn't overshoot the correct solution. The discussion then moves to the concept of linear separability, where if a dataset can be split into positive and negative examples by a line, the Perceptron Rule can find such a line in a finite number of iterations. The paragraph also touches on the challenges of determining linear separability in higher dimensions and the practical approach of running the Perceptron algorithm to see if it converges, which would indicate linear separability. The conversation highlights the theoretical connection between the halting problem and the ability to determine linear separability, suggesting that while solving one might imply solving the other, the converse is not necessarily true.
Mindmap
Keywords
💡Perceptron Rule
💡Gradient Descent
💡Weights
💡Threshold
💡Bias Unit
💡Learning Rate
💡Training Set
💡Linearly Separable
💡Approximate Output (y hat)
💡Error
Highlights
The Perceptron Rule and gradient descent are two methods for setting weights in machine learning.
The Perceptron Rule uses threshold outputs, while gradient descent uses unthreshold values.
The Perceptron Rule is used to set the weights of a single unit to match a training set.
A training set consists of input vectors (x) and desired output values (y).
Weights are modified over time to capture the training data set.
A learning rate is introduced for the weights, but a learning rule is given for Theta.
Theta is treated as another weight by adding a bias unit to the inputs.
The threshold can be treated the same as the weights by this method.
The weight change is defined as the difference between the target output and the network's current output.
The weight update is scaled by the learning rate and the input to the unit.
If the output is correct, there will be no change to the weights.
If the output is incorrect, the weights are adjusted to correct the error.
The learning rate controls the magnitude of the weight adjustments to avoid overshooting.
The Perceptron Rule can find a separating line for linearly separable data sets.
The algorithm will stop if a separating line is found, indicating linear separability.
If the algorithm does not stop, it suggests the data may not be linearly separable.
The algorithm's ability to stop can be used as a test for linear separability.
The simplicity of the Perceptron Rule makes it a powerful tool for machine learning.
Transcripts
Alright. So in the examples up to this point, we've be
setting the weights by hand to make various functions happen. And that's
not really that useful in the context of machine learning. We'd really
like a system, that given examples, finds weights that map the inputs
to the outputs. And we're going to actually look at two different
rules that have been developed for doing exactly that, to figuring out
what the weights ought to be from training examples. One is called
the the Perceptron Rule, and the other is called gradient descent or
the Delta Rule. And the difference between them is the
perception rule is going to make use of the threshold
outputs, and the, the other mechanism is going to use
unthreshold values. Alright so what we need to talk about
now is the perception rule for, which is, how to
set the weights of a single unit. So that it
matches some training set. So we've got a training set,
which is a bunch of examples of x, these are vectors,
and we have y's which are zeros and ones which are the,
the output that we want to hit. And what we want to do is
set the, set the weights so that we capture this, this same data
set. And we're going to do that by, modifying the weights over time.
>> Oh, Michiel, what's the series of dashes over on the left.
>> Oh, sorry, right. I should mention that, so
one of the things that we're going to do here is
were going to give a learning rate for the weights W,
and not give a learning rule for Theta But we do
need to learn the theta. So there's a, there's a very
convenient trick for actually learning them by just treating it as
a, as another kind of weight. So if you think about
the way that the, the thresholding function works. We're taking a
linear combination of the W's and X's, then we're comparing it
to theta,but if you think about just subtracting theta from both
sides, then, in some sense theta just becomes another
one of the weights, and we're just comparing to
zero. So what, what I did here was took
the actual data, the x's, and I added what is
sometimes called a, a bias unit to it. So
basically, the input is one always to that. And the
weight corresponding to it is going to correspond to
negative theta ultimately. So, just, just again, this just simplifies
things so that the threshold can be treated the same
as the weights. So from now on, we don't have
to worry about the threshold. It just gets folded into
the weights, and all our comparisons are going to be just
to zero instead of some, instead of theta. Centric, yeah.
It certainly makes the math shorter. So okay, so this
is what we're going to do. We're going to iterate over this
training set, grabbing an x, which includes the bias piece,
and the y. Where y is our target X is our
input. And what we're going to do is we're going to
change weight i, the, the, the weight corresponding to the ith
unit, by the amount that we're changing the weight by. So
this is sort of a tautology, right. This is truly just
saying the amount we've changed the weight by is exactly delta
W - in other words the amount we've changed the weight
by. So we need to define that what that weight change is.
The weight change is going to be find as
falls. We're going to take the target, the thing that
we want the output to be. And compare it
to, what the network with the current weight actually spits
out. So we compute this, this y hat. This
approximate output y. By again summing up the inputs according
to the weights and comparing it to zero. That gets
us a zero one value.So we're now comparing that to
what the actual value is. So what's going to happen here, if
they are both zero so let's, let's look at this. Each of
y and y hat can only be zero and one. If they
are both zeros then this y minus y hat is zero. If
they're both ones and what does that mean? It means the output
should have been zero and the output of our current. Network really
was zero, so that's, that's kind of good. If they are both ones,
it means the output was supposed to be one and our network outputted
one, and the difference between them is going to be zero. But in
this other case, y minus y hat, if the output was supposed to
be zero, but we said one, our network says one, then we
get a negative one. If the output was supposed to be one and
we said zero, then we get a positive one. Okay, so those
are the four cases for what's happening here. We're going to take that value
multiply it by the current input to that unit i, scale it
down by the sort of thing that is going to be cut the learning
rate and use that as the the weight update
change. So essentially what we are saying is if the
output is already correct either both on or both
off. Then there's gong to be no change to the
weights. But, if our output is wrong. Let's say
that we are giving a one when we should have
been giving a zero. That means our, the total
here is too large. And so we need to make
it smaller. How are we going to make it
smaller? Which ever input XI's correspond too, very large
values, we're going to move those weights very far in
a negative direction. We're taking this negative one times that
value times this, this little learning rate. Alright, the
other case is if the output was supposed to one
but we're outputting a zero, that means our total
is too small. And what this rule says is increase
the weights essentially to try to make the sum bigger. Now, we
don't want to kind of overdo it, and that's what this learning rate
is about. Learning rate basically says we'll figure out the direction that
we want to move things and just take a little step in that
direction. We'll keep repeating over all of the, the input output pairs.
So, we'll have a chance to get in to really build things up,
but we're going to do it a little bit at a time so
we don't overshoot. And that's the
rule. It's actually extremely simple. Like, you,
actually writing this in code is, is quite trivial. And and
yet, it does some remarkable things. So let's imagine for a
second that we have a training set that looks like this.
It's in two dimensions, again, so that it's easy to visualize.
That we've got. A bunch of positive examples, these green x's
and we've got a bunch of negative examples these red x's,
and were trying to learn basically a half plane right? Were
trying to learn a half plane that separates the positive from the
negative examples. So Charles do you see a, a, half plane
that we could put in here that would do the trick?
>> I do.
>> What would it look like?
>> It's that one.
>> By that one do you mean, this one?
>> Yeah. That's exactly what I was thinking, Michael.
>> That's awesome! Yeah, there are isn't, isn't a whole lot
of flexibility in what the answer is in this case, if
we really want to get all greens on one side and all
the reds on the other. If there is such a half
plane that separates the positive from the negative examples, then
we say that the data set is linearly separable, right? That
there is a way of separating the positives and negatives with
a line. And what's cool about the perception rule, is that
if we have data that is linearly separable. The Perceptron Rule
will find it. It only needs a finite number of iterations
to find it. In fact, which I guess is really the
same as saying that it will actually find it. It won't
eventually get around to getting to something close to it.
It will actually find a line, and it will stop
saying okay I now have a set of weights that,
that do the trick. So that's happens if the data set
is in fact linearly separable and that's pretty cool. It's
pretty amazing that it can do that, it's a very
simple rule and it just goes through and iterates and,
and solves the problem. So. Charles Sened solves the problem. So.
>> I can think of one.
What if it is not linearly separable?
>> Hmm, I see. So, if the data is
linearlly separable, then the algorithm works, so the algorithm
simply needs to only be run when the data
is linearlly separable. It's generally not that easy tell actually,
when your data is linearly separable especially, here we
have it in two dimensions, if it's in 50 dimensions,
know whether or not there is a setting of
those perimeters that makes it linearly separable, not so clear.
>> Well there is one way you could do it.
>> Whats that?
>> You could run this algorithm, and see if it ever stops. I see,
yes of course, there's a problem with that particular scheme, right, which says,
well for one thing this algorithm never stops, so wait, we need to, we
need to address that. But, but really we should be running this loop here,
while, there's some error so I neglected to say that
before. But what you'll notice is if you continue to run
this after the point where it's getting all the answers right.
It found a set of weights that lineally separate the positive
and negative instances what will happen is when it gets
to this delta w line that y minus y hat will
always be zero the weights will never change we'll go back
and update them by adding zero to them repeatedly over and
over again. So. If it ever does reach zero
error, if it ever does separate the data set
then we can just put a little condition in
there and tell it to stop filtering So what you
are suggesting is that we could run this algorithm
and if it stops then we know that it is
linearly separable and if it doesn't stop Then we
know that it's not linearly separable, right? By this guarantee.
>> Sure.
>> The problem is we, we don't know when finite is done, right?
If, if this were like 1,000 iterations, we could run it for 1,000 if it wasn't
done. It's not done, but all we know at this point is that it's a finite number
of iterations, and so that could be a
thousand, 10 thousand, a million, ten million, we
don't know, so we never know when to
stop and declare the data set not linearly separable.
>> Hmm, so if we could do that,
then we would have solved the halting problem, and
we would all have nobel prizes Well, that's
not necessarily the case. But it's certainly the other
direction is true. That if we could solve
the halting problem, then we could solve this.
>> Hm.
>> But it could be that this problem
might be solvable even without solving the halting problem.
>> Fair enough. Okay.
Ver Más Videos Relacionados
Deep Learning(CS7015): Lec 2.5 Perceptron Learning Algorithm
Perceptron Learning Algorithm
11. Implement AND function using perceptron networks for bipolar inputs and targets by Mahesh Huddar
Gradient Descent, Step-by-Step
Neural Networks Demystified [Part 4: Backpropagation]
Machine Learning Tutorial Python - 4: Gradient Descent and Cost Function
5.0 / 5 (0 votes)