Week 1 Lecture 2 - Supervised Learning
Summary
TLDRThis module delves into supervised learning, focusing on classification and regression tasks. It discusses using labeled data to predict outcomes like whether a customer will buy a computer, employing various models from lines to curves. The importance of generalization, avoiding overfitting, and selecting the right complexity for a classifier is highlighted. The script also touches on inductive biases, the training process, and applications of regression in time series prediction and trend analysis.
Takeaways
- 📚 The module focuses on supervised learning, which involves using labeled data to build a classifier or model for prediction.
- 🛍️ An example given is using a customer database with attributes like age and income to predict whether a customer will buy a computer or not.
- 📈 The script discusses the idea of creating a function or mapping that takes inputs (age and income) and predicts an output (buy/not buy).
- 📊 It highlights geometric interpretations of data, such as using lines or curves to classify data points into different categories based on their attributes.
- 🔍 The importance of considering the complexity of the classifier versus its accuracy is emphasized, noting the trade-off between the two.
- ⚖️ The concept of inductive bias is introduced, which includes language bias (type of lines or curves used) and search bias (order of examining possible lines/curves).
- 🔢 The process of training a model involves using a training set, evaluating it with a validation set, and iterating if necessary to improve the model.
- 🔄 The iterative process of a learning agent involves producing an output, comparing it with the actual target, calculating the error, and adjusting the agent to minimize future errors.
- 🌐 Applications of supervised learning are vast, including fraud detection, sentiment analysis, churn prediction, medical diagnosis, and more.
- 📉 The script also covers regression, a type of supervised learning where the output is a continuous value, using examples like predicting temperatures based on time of day.
- 🔧 Linear regression is mentioned as a method to fit a line that minimizes prediction error, often using the least squares approach to handle continuous outputs effectively.
Q & A
What is the primary goal of supervised learning as described in the script?
-The primary goal of supervised learning, as described in the script, is to predict a specific output based on labeled input data. In this case, the goal is to predict whether a customer will buy a computer or not based on their age and income attributes.
What are the two main attributes used to describe the customers in the given example?
-The two main attributes used to describe the customers are age and income.
What is the difference between classification and regression in the context of supervised learning?
-In classification, the output is a discrete value, such as yes or no in the example of predicting computer purchase. In regression, the output is a continuous value, like temperature at different times of the day.
How does the script describe the process of creating a function for classification?
-The script describes creating a function for classification by drawing lines or curves in the input space to separate the classes. Initially, a simple line based on income is used, but later a more complex function considering both age and income is introduced for better accuracy.
What is the term used for the assumption made about the distribution of input data and class labels in the script?
-The term used for the assumption made about the distribution of input data and class labels is 'inductive bias'.
What are the two categories of inductive bias mentioned in the script?
-The two categories of inductive bias mentioned are 'language bias', which refers to the type of lines or curves to be drawn, and 'search bias', which refers to the order in which the possible lines or curves are examined.
How does the script explain the concept of overfitting in the context of regression?
-The script explains overfitting as trying to fit the noise in the data, where the solution attempts to predict the noise in the training data correctly, rather than capturing the underlying trend or pattern.
What is the method described in the script to avoid overfitting in regression?
-The method described to avoid overfitting in regression is linear regression, which aims to minimize the sum of the squares of the errors made by the prediction line.
What is the purpose of a validation set in the training process of a classifier?
-The purpose of a validation set is to evaluate the performance of the training algorithm without showing the labels to the algorithm. It helps to assess whether the classifier is accurate and to make adjustments if necessary.
How does the script illustrate the concept of generalization in supervised learning?
-The script illustrates generalization by discussing the need to make assumptions about the lines or curves that segregate different classes, allowing the classifier to predict outcomes for new, unseen data points based on the training data.
What are some of the applications of supervised learning mentioned in the script?
-Some applications mentioned in the script include fraud detection, sentiment analysis, churn prediction, medical diagnosis, time series prediction, trend analysis, and risk factor analysis.
Outlines
📊 Supervised Learning and Data Classification
This paragraph introduces the concept of supervised learning, focusing on the use of a customer database to predict computer purchases based on age and income attributes. The goal is to build a classifier that maps inputs to a discrete output (yes or no). The speaker discusses the geometric interpretation of data and the idea of creating functions, such as lines or curves, to separate different classes in the input space. The example provided illustrates a simple linear classifier that uses income as the sole attribute to predict computer purchases, highlighting the trade-off between simplicity and accuracy.
📈 Refining Classifiers with Age and Income Considerations
The speaker refines the classification model by incorporating both age and income as factors for predicting computer purchases. The explanation details how the model's performance improves by adjusting thresholds based on age, requiring higher income for older customers to be classified as likely to purchase a computer. The paragraph emphasizes the balance between model complexity and performance, noting the potential for overfitting when a model becomes too tailored to the training data.
🔍 The Evolution of Classifier Complexity
This section discusses the progression from simple linear classifiers to more complex models, such as quadratic functions, to improve prediction accuracy. The speaker points out the challenges of defining highly complex classifiers that may not necessarily yield better results and the importance of considering noise in the data. The concept of inductive bias is introduced, explaining how assumptions about the form of classification boundaries (language bias) and the search strategy for finding these boundaries (search bias) influence the generalization from training data to the entire input space.
🔧 Training Algorithms and Model Evaluation
The paragraph delves into the process of training algorithms, where a set of inputs and corresponding outputs (X and Y) are used to train a model. The speaker explains the importance of normalization to manage numerical instabilities and the iterative process of model training, validation, and refinement. The role of a test or validation set in evaluating the model's performance is highlighted, along with the potential need to adjust parameters or algorithms based on performance outcomes.
🌐 Applications of Supervised Learning
The speaker provides various applications of supervised learning, such as fraud detection, sentiment analysis, churn prediction, medical diagnosis, and risk analysis. The paragraph also introduces the concept of regression as a type of supervised learning where the output is a continuous value, contrasting it with classification tasks. Examples of fitting lines or curves to data points to predict outcomes, such as temperature over time, are given, and the issue of overfitting in complex models is discussed.
📉 Regression Analysis and Its Applications
This paragraph focuses on linear regression as a method for predicting continuous outcomes, explaining the process of minimizing the sum of squared errors to find the best-fit line. The speaker clarifies that linear regression can be applied to higher-order functions by transforming input variables, thus solving complex problems. The paragraph concludes by discussing the wide range of applications for regression, including time series prediction, classification, data reduction, trend analysis, and risk factor analysis.
Mindmap
Keywords
💡Supervised Learning
💡Classifier
💡Attributes
💡Geometric Interpretation
💡Inductive Bias
💡Training Set
💡Validation
💡Overfitting
💡Linear Regression
💡Regression
💡Feature Transformation
Highlights
Introduction to supervised learning and the concept of using experience with labeled data to predict outcomes such as whether a customer will buy a computer.
Description of the customer database using two attributes, age and income, for the purpose of classification.
The goal of creating a function or mapping to predict binary outcomes (yes/no) based on input attributes.
Geometric interpretation of data points and the natural approach of defining functions through lines or curves in the input space.
Simple function definition based on income threshold to classify computer buyers, ignoring the age variable initially.
Improvement of the classification function by considering both age and income, leading to a more complex but accurate model.
The trade-off between model complexity and performance, with the example of a quadratic function providing better accuracy but at a higher complexity.
The issue of overfitting when a model becomes too complex and tries to predict noise in the training data.
The concept of inductive biases in machine learning, including language bias and search bias, which influence the type of model and the search process for the best model.
The process of training a model with a training set and evaluating it with a test or validation set to ensure generalization.
The iterative process of adjusting the model based on error comparison between predicted and actual outputs.
Applications of supervised learning in various fields such as fraud detection, sentiment analysis, churn prediction, and medical diagnosis.
The difference between supervised learning for classification and regression, with regression predicting continuous values like temperature.
The challenge of avoiding overfitting in regression by finding the right balance between model complexity and prediction error.
The versatility of linear regression in fitting higher-order functions by transforming input variables, thus solving complex problems.
Practical applications of regression in time series prediction, classification using regression curves, data reduction, trend analysis, and risk factor analysis.
Transcripts
So in this module we will look at supervised learning right.
If you remember in supervised learning we talked about experience right where you have some kind of
a description of the data. So in this case let us assume that I have a customer database and
I am describing that by two attributes here, age and income. So I have each customer that
comes to my shop I know the age of the customer and the income level of the customers right.
And my goal is to predict whether the customer will buy a computer or not buy a computer right.
So I have this kind of labeled data that is given to me for building a classifier right,
remember we talked about classification where the output is a discrete value in this case
it is yes or no, yes this is the person will buy a computer, no the person will
not buy a computer. And the way I describe the input is through a set of attributes in this
case we are looking at age and income as the attributes that describe the customer right.
And so now the goal is to come up with a function right,
come up with a mapping that will take the age and income as the input and it
will give you an output that says the person will buy the computer or not buy
the computer. So there are many different ways in which you can create this function.
And given that we are actually looking at a geometric interpretation of the data,
I am looking at data as points in space, the one of the most natural ways of thinking about
defining this function is by drawing lines or curves on the input space right. So here
is one possible example, so here I have drawn a line and everything to the left of the line
right. So these are points that are red right, so everything to the left of the line would
be classified as will not buy a computer, everything to the right of the line where
the predominantly the data points are blue will be classified as will buy a computer.
So how would the function look like, it will look like something like if the income of a
person remember that the x-axis is income and the y-axis is age. So in this case it
basically says that if the income of the person is less than some value right,
less than some X then the person will not buy a computer. If the income is greater than X the
person will buy your computer. So that is the kind of a simple function that we will define.
Just notice that way we completely ignore one of the variables here which is the age. So we
are just going by income, if the income is less than some X then the person will not
buy a computer, if the income is greater than X the person will buy a computer. So
is this a good rule more or less I mean we get most of the points correct right
except a few right. So it looks like yeah, we can we can survive with this
rule right. So this is not too bad right, but then you can do slightly better.
All right, so now we got those two red points that those just keep that points
are on the wrong side of the line earlier now seem to be on the right side right. So
everything to the left of this line will not buy a computer, everything to the right will
buy a computer right, everyone moves to the right will buy a computer. So if you think
about what has happened here, so we have improved our performance measure right.
So the cost of something, so what is the cost here. So earlier we are only paying
attention to the income right, but now we have to pay attention to the age as
well right. So the older you are right, so the income threshold at which we will buy
a computer is higher right. So the younger you are, younger means lower on the y axis,
so the younger you are the income threshold at which you will buy a computer is lower right.
So is that clear, so the older you are, the income threshold is shifted to the
right here. So the older you are, you need to have a higher income before you buy a computer
and the younger you are your income threshold is lower, so you do not mind buying a computer
even if your income is slightly lesser. So now we have to start paying attention to
the age right, but then the advantage is you get much better performance right.
Can you do better than this yes? Now almost everything is correct except that one pesky
red point, but everything else is correct. And so what has happened here we get much
better performance, but at the cost of having a more complex classifier right.
So earlier if you thought about it in geometric terms, so first you had a line that was parallel
to the y-axis therefore, I just needed to define a intercept on the x-axis right. So
if X is less than some value then it was one class was greater than some value was
another class. Then the second function it was actually a slanting line like that,
so I needed to define both the intercept and the slope right.
And now here it is now a quadratic so I have to define three parameters right.
So I have to define something like ax2+ bx+c, so I have defined the a,
b, c – the three parameters in order to find the quadratic, and I am getting better performance.
So can you do better than this? Okay that somehow does not seem right correct seems
to be too complex a function just to be getting this one point there right. And I am not sure I
am not even sure how many parameters you need for drawing that because Microsoft use some
kind of spline PowerPoint use some kind of spline interpolation to draw this curve I am
pretty sure that it is got lot more parameters than it is worth. Another thing to note here
is that that particular red point that you see is actually surrounded by a sea of blue right.
So it is quite likely that there was some glitch there either the person actually
bought a computer and we never we have not recorded it has been having bought a
computer or there are some extremist reason the person comes into the shop sure that is going to
buy a computer but then gets a phone call saying that some emergency please come out immediately
and therefore he left without buying a computer right there could be variety of reasons for why
that noise occurred and this will probably be the more appropriate classifier right.
So these are the kinds of issues I would like to think about what is the complexity of the
classifier that I would like to have right and versus the accuracy of the classifier, so how
good is the classifier in actually recovering the right input output map and or their noise
data in the in the input in the experience that I am getting is it clean or is there noise on
it and if so how do I handle that noise these are the kinds of issues that we have to look at okay.
So these kinds of lines that we drew right kind of hiding one assumption that we are
making so the thing is the data that comes to me comes as discrete points in the space
right and from these discrete points in the space I need to generalize and
be able to say something about the entire state space right so I do not care where
the data point is on the x and y-axis right I should be able to give a label to that right.
If I do not have some kind of assumption about these lines right and if you do not
have some kind of assumptions about these lines the only thing I can do is if the same customer
comes again or somebody who has exact same age and income as that cause customer comes again
I can tell you whether the person is going to buy a computer or not buy a computer but I will
not be able to tell you about anything else outside of the experience right.
So the assumption we made is everything to the left of a line is going to do one thing
or the other; so everything to the left of the line will not buy the computer everything to
the right or everyone to the right will buy a computer this is an assumption I made the
assumption was the lines are able to segregate people who buy from who do not buy the lines or
the curves were able to segregate people who will buy from who will not buy so that is a kind of an
assumption I made about the distribution of the input data and the class labels.
So this kind of assumptions that we make about these lines are known as inductive biases in
general inductive bias has like two different categories one is called language bias which is
essentially the type of lines that I am going to draw. Am I gonna draw straight
lines or am I going to draw curves and what order polynomials am I going to look at and
so on so forth these for my language bias. And search bias is the other form of inductive bias
that tells me how in what order am I going to examine all these possible lines right.
So that gives me the gives me a search bias right, so putting these things together we
are able to generalize from a few training points to the entire space of inputs right
I will make this more formal as we go on and then in the next set of modules right.
And so here is one way of looking at the whole process so I am going to be giving you a set
of data which we will call the training set. So the training set will consist of say as an
input which we'll call as X and an output which we call as Y right, so I am going to have a set
of inputs I have X1, X2, X3, X4 likewise I will have Y1, Y2, Y3, Y4 and this data is fed into a
training algorithm right and so the data is going to look like this in our case right.
So remember our X’s are the input variable X’s are the inputs so in this case that should have
the income and the age, so x1 is like 30,000 and 25 and x2 is like 80,000 and 45 and so
on so forth and the y's or the labels they correspond to the colors in the previous
picture right so y1 does not buy a computer y2 buys a computer and so on so forth so this
essentially gives me the color coding so y1 is essentially red and y2 is blue right and
I really if I am going to use something numeric this is what we will be doing later on I really
cannot be using these values. First of all wise or not numeric and the X’s vary too much right.
So the first coordinate in the X is like 30,000 and 80,000 and so on so forth and
the second coordinate is like 25 and 45 so that is a lot a lot smaller in magnitude so
this will lead to some kind of numerical instabilities. So what will typically end
up doing is normalizing these so that they form approximately in the same range so you
can see that I have try to normalize these X values between 0 and 1 right.
So have chosen an income level of say 2 lakhs it is the maximum and age of 100 and you can see the
normalized values and likewise for buys and not buy I have taken not by as - 1 and by as computer
is + 1. These are arbitrary choices, now but later on you will see that there are specific reasons
for wanting to choose this encoding in this way. And then the training algorithm chugs over this
data right and it will produce a classifier so now this classifier I do not know whether it is
good or bad right so we had a straight line in the first case right an axis parallel line if we did
not know the good or bad and we needed to have some mechanism by which we evaluate this right.
So how do we do the evaluation typically is that you have what is called a test set or a
validation set right so this is another set of x and y pairs like we had in the training set,
so again in the test set we know what the labels are it is just that we are not showing it to the
training algorithm we know what the labels are because we need to use the correct labels to
evaluate whether your training algorithm is doing good or bad right so, so this process
by which this evaluation happens is called validation. Then at the end of the validation,
if you are happy with the quality of the classifier we can keep it. If you are not
happy then go back to the training algorithm and say hey I am not happy with what you produced
give me something different right, so we have to either iterate over the algorithm again we
will go over the data again and try to refine the parameter estimation or we could even think
of changing some parameter values and then trying to redo the training algorithm all over again. But
this is the general process and we will see that many of the different algorithms that we look,
look at in the course; during the course of these lectures actually follow this kind of a process .
So what happens inside that green box? So inside the training algorithm is that there
will be this learning agent right which will take an input and it will produce an output Ŷ
which it thinks is the correct output right but it will compare it against the actual
target Y it was given for the in the training right, so in the training you actually have a
target Y so it will compare it against a target why right and then figure out what the error is
and use the error to change the agent right so then it can produce the right output next
time around. This is essentially an iterative process so you see that input okay produce an
output Ŷ and then you take the target Y, you can compare it to the Ŷ, figure out what is
the error and use the error to change the agent again right. And this is by and large the way
most of the learning algorithms will operate; most of the classification algorithms or even
regression algorithms will operate and we will see how each of this works as, we go on right.
There are many, many applications. I mean this is too numerous to list. Here are a few examples you
could look at say a fraud detection right, so we have some data where the input is a set of
transactions made by a user and then you can flag each transaction as a valid transaction
or not. You could look at sentiment analysis you know variedly called as opinion mining or
buzz analysis etc., where I give you a piece of text or a review written about a product or a
movie and then you tell me whether the movies whether the review is positive or whether is
negative and what are the negative points that people are mentioning about and so on
so forth and this again a classification task. Or you could use it for doing churn prediction
where you are going to say whether a customer who is in the system is likely to leave your
system or is going to continue using your product or using your service for a longer period of time,
so this is essentially churn so when a person leaves your services you call the
person churner and you can label what the person is churner or not. And I have been
giving you examples form medical diagnosis all through apart from actually diagnosing
whether a person has a disease or not you could also use it for risk analysis in the slightly
indirect way I will talk about that when we when we do the algorithms for classification.
So we talked about how we are interested in learning different lines or curves that can
separate different classes in supervised learning and, so this curves can be represented using
different structures and throughout the course we will be looking at different kinds of learning
mechanisms like artificial neural networks support vector machines decision trees nearest neighbors
and Bayesian networks and these are some of the popular ones and we look at these in more detail
as the course progresses. So another supervised learning problem is the one of prediction.
Or regression where the output that you are going to predict is no longer a discrete value it is
not like – will buy a computer or does not buy a computer – it is more of a continuous
value so here is an example, where at different times of day you have recorded the temperature
so the input to the system is going to be the time of day and the output from the system is
going to be the temperature that was measured at a particular point at the time right. So you
are going to get your experience or your training data is going to take this form so the blue points
would be your input and the red points would be the outputs that you are expected to predict.
So note here that the outputs are continuous or real value right and so you could think of this in
this toy example as points to the left being day and the points to the right being night right. And
just as in the previous case of classification, so we could try to do these simple as possible
fit in this case which would be to draw a straight line that is as close as possible to these points
now you do see that like in the classification case when it choose a simple solution there are
certain points at which we are making large errors right so we could try to fix that.
And try to do something more fancy. But you could see that while the day time temperatures are more
or less fine but with the night times we seem to be doing something really off right because we
are going off too much to the right-hand side. How if you could do something more complex just
like in the classification case where we wanted to get that one point right so we could try and
fit all these temperatures that were given to us by looking at a sufficiently complex curve.
And again this as we discussed earlier is probably not the right answer and you are
probably in this case surprisingly or better off fitting the straight line right. So these
kinds of solutions where we try to fit the noise in the data we are trying to make the
solution predict the noise in the training data correctly are known as over fitting,
over fit solutions and one of the things that we look to avoid in, in machine learning is
to over fit to the training data. So we will talk about this again in due course.
So what we do is typically we would like to do what is called linear regression some of you
might have come across this under different circumstances and the typical aim in linear
regression is to say take the error that your line is making so if you take an example point,
let us say I take any I take an example point somewhere here right.
So this is the actual training data that is given to you and this is the
prediction that your line is making at this point so this quantity is essentially the,
the prediction error that this line is making and so what you do is you try to find that line
that has the least prediction error right so you take the square of the errors that
your prediction is making and then you try to minimize the, the sum of the squares of
the errors. Why do we take the squares? Because errors could be both positive or negative and
we want to make sure that you are minimizing that regardless of the sign of the error okay.
So with sufficient data right so a linear regression is simple enough you could just
solve it using matrix inversions as we will see later but with many dimensions the challenge is to
avoid over fitting like we talked about earlier and then there are many ways of avoiding this.
And so I will again talk about this in detail when we look at linear regression
right. So one point that I want to make is that linear regression is not as simple as
it sounds right. So here is an example so I have two input variables x1 and x2 right
and if I try to fit a straight line with x1 and x2 I will probably end up with something
like a1 x1 plus a2 x2 right and that looks like, like a plane in two dimensions right.
But then if I just take these two dimensions and then transform them transform the input
so instead of saying just the x1 and x2 if I say my input is going to look like x1 square,
x2 squared, x1x2 and then the x1 and x2 as it was in the beginning so instead
of looking at a two-dimensional input if I am going to look at a 5 dimensional input right,
and now I am going to fit a line or a linear plane in this 5 dimensional input so that will
be like a1 x1 squared plus a2 x2 square plus a3 x1 x2 plus a4 x1 plus a5 x2. Now that is no
longer the equation of a line in two dimensions right so that is the equation of a second-order
polynomial in two dimensions but I can still think of this as doing linear regression because
I am only fitting a function that is going to be linear in the input variables right.
So by choosing an appropriate transformation of the inputs, I can fit any higher-order function
so I could solve very complex problems using linear regression and so it is not really a
weak method as you would think at first, first glance. Again, we will look at this in slightly
more detail in the later lectures right and regression our prediction can be applied in
a variety of places – one popular place is in time series prediction you could think about predicting
rainfall in a certain region or how much you are going to spend on your telephone calls you could
think of doing even classification using this, if you think of; you remember our encoding of +1 and
-1 for the class labels. So you could think of +1 and -1 as the outputs right and then you can fit a
regression line regression curve to that and if the output is greater than 0 you would say this
class is +1 its output is less than 0 you see the class is -1 so it could use the regression
ideas to solve the classification problem and you could also do data reduction. So I really do not
want to give you all the millions of data points that I have in my data set but what I would do is
essentially fit the curve to that and then give you just the coefficients of the curve right.
And more often than not that is sufficient for us to get a sense of the data and that brings us to
the next application I have listed there which is trend analysis so I am not really interested in;
quite many times, I am not interested in the actual values of the data but more in the,
the trends so for example I have a solution that I am trying to
measure the running times off and I am not really interested in the
actual running time because with 37seconds to 38 seconds is not going to tell me much.
But I would really like to know if the running time scales linearly or exponentially with the
size of the important all right so those kinds of analysis again can be done using
regression. And in the last one here is again risk factor analysis like we had
in classification and you can look at which are the factors that contribute
most to the output so that brings us to the end of this module on supervised learning.
Browse More Related Video
Week 2 Lecture 6 - Statistical Decision Theory - Classification
Different Types of Learning
Supervised vs. Unsupervised Learning
#4 Machine Learning Specialization [Course 1, Week 1, Lesson 2]
The Fundamentals of Machine Learning
Key Machine Learning terminology like Label, Features, Examples, Models, Regression, Classification
5.0 / 5 (0 votes)