Week 1 Lecture 2 - Supervised Learning

Machine Learning- Balaraman Ravindran
4 Aug 202124:36

Summary

TLDRThis module delves into supervised learning, focusing on classification and regression tasks. It discusses using labeled data to predict outcomes like whether a customer will buy a computer, employing various models from lines to curves. The importance of generalization, avoiding overfitting, and selecting the right complexity for a classifier is highlighted. The script also touches on inductive biases, the training process, and applications of regression in time series prediction and trend analysis.

Takeaways

  • 📚 The module focuses on supervised learning, which involves using labeled data to build a classifier or model for prediction.
  • 🛍️ An example given is using a customer database with attributes like age and income to predict whether a customer will buy a computer or not.
  • 📈 The script discusses the idea of creating a function or mapping that takes inputs (age and income) and predicts an output (buy/not buy).
  • 📊 It highlights geometric interpretations of data, such as using lines or curves to classify data points into different categories based on their attributes.
  • 🔍 The importance of considering the complexity of the classifier versus its accuracy is emphasized, noting the trade-off between the two.
  • ⚖️ The concept of inductive bias is introduced, which includes language bias (type of lines or curves used) and search bias (order of examining possible lines/curves).
  • 🔢 The process of training a model involves using a training set, evaluating it with a validation set, and iterating if necessary to improve the model.
  • 🔄 The iterative process of a learning agent involves producing an output, comparing it with the actual target, calculating the error, and adjusting the agent to minimize future errors.
  • 🌐 Applications of supervised learning are vast, including fraud detection, sentiment analysis, churn prediction, medical diagnosis, and more.
  • 📉 The script also covers regression, a type of supervised learning where the output is a continuous value, using examples like predicting temperatures based on time of day.
  • 🔧 Linear regression is mentioned as a method to fit a line that minimizes prediction error, often using the least squares approach to handle continuous outputs effectively.

Q & A

  • What is the primary goal of supervised learning as described in the script?

    -The primary goal of supervised learning, as described in the script, is to predict a specific output based on labeled input data. In this case, the goal is to predict whether a customer will buy a computer or not based on their age and income attributes.

  • What are the two main attributes used to describe the customers in the given example?

    -The two main attributes used to describe the customers are age and income.

  • What is the difference between classification and regression in the context of supervised learning?

    -In classification, the output is a discrete value, such as yes or no in the example of predicting computer purchase. In regression, the output is a continuous value, like temperature at different times of the day.

  • How does the script describe the process of creating a function for classification?

    -The script describes creating a function for classification by drawing lines or curves in the input space to separate the classes. Initially, a simple line based on income is used, but later a more complex function considering both age and income is introduced for better accuracy.

  • What is the term used for the assumption made about the distribution of input data and class labels in the script?

    -The term used for the assumption made about the distribution of input data and class labels is 'inductive bias'.

  • What are the two categories of inductive bias mentioned in the script?

    -The two categories of inductive bias mentioned are 'language bias', which refers to the type of lines or curves to be drawn, and 'search bias', which refers to the order in which the possible lines or curves are examined.

  • How does the script explain the concept of overfitting in the context of regression?

    -The script explains overfitting as trying to fit the noise in the data, where the solution attempts to predict the noise in the training data correctly, rather than capturing the underlying trend or pattern.

  • What is the method described in the script to avoid overfitting in regression?

    -The method described to avoid overfitting in regression is linear regression, which aims to minimize the sum of the squares of the errors made by the prediction line.

  • What is the purpose of a validation set in the training process of a classifier?

    -The purpose of a validation set is to evaluate the performance of the training algorithm without showing the labels to the algorithm. It helps to assess whether the classifier is accurate and to make adjustments if necessary.

  • How does the script illustrate the concept of generalization in supervised learning?

    -The script illustrates generalization by discussing the need to make assumptions about the lines or curves that segregate different classes, allowing the classifier to predict outcomes for new, unseen data points based on the training data.

  • What are some of the applications of supervised learning mentioned in the script?

    -Some applications mentioned in the script include fraud detection, sentiment analysis, churn prediction, medical diagnosis, time series prediction, trend analysis, and risk factor analysis.

Outlines

00:00

📊 Supervised Learning and Data Classification

This paragraph introduces the concept of supervised learning, focusing on the use of a customer database to predict computer purchases based on age and income attributes. The goal is to build a classifier that maps inputs to a discrete output (yes or no). The speaker discusses the geometric interpretation of data and the idea of creating functions, such as lines or curves, to separate different classes in the input space. The example provided illustrates a simple linear classifier that uses income as the sole attribute to predict computer purchases, highlighting the trade-off between simplicity and accuracy.

05:04

📈 Refining Classifiers with Age and Income Considerations

The speaker refines the classification model by incorporating both age and income as factors for predicting computer purchases. The explanation details how the model's performance improves by adjusting thresholds based on age, requiring higher income for older customers to be classified as likely to purchase a computer. The paragraph emphasizes the balance between model complexity and performance, noting the potential for overfitting when a model becomes too tailored to the training data.

10:08

🔍 The Evolution of Classifier Complexity

This section discusses the progression from simple linear classifiers to more complex models, such as quadratic functions, to improve prediction accuracy. The speaker points out the challenges of defining highly complex classifiers that may not necessarily yield better results and the importance of considering noise in the data. The concept of inductive bias is introduced, explaining how assumptions about the form of classification boundaries (language bias) and the search strategy for finding these boundaries (search bias) influence the generalization from training data to the entire input space.

15:11

🔧 Training Algorithms and Model Evaluation

The paragraph delves into the process of training algorithms, where a set of inputs and corresponding outputs (X and Y) are used to train a model. The speaker explains the importance of normalization to manage numerical instabilities and the iterative process of model training, validation, and refinement. The role of a test or validation set in evaluating the model's performance is highlighted, along with the potential need to adjust parameters or algorithms based on performance outcomes.

20:13

🌐 Applications of Supervised Learning

The speaker provides various applications of supervised learning, such as fraud detection, sentiment analysis, churn prediction, medical diagnosis, and risk analysis. The paragraph also introduces the concept of regression as a type of supervised learning where the output is a continuous value, contrasting it with classification tasks. Examples of fitting lines or curves to data points to predict outcomes, such as temperature over time, are given, and the issue of overfitting in complex models is discussed.

📉 Regression Analysis and Its Applications

This paragraph focuses on linear regression as a method for predicting continuous outcomes, explaining the process of minimizing the sum of squared errors to find the best-fit line. The speaker clarifies that linear regression can be applied to higher-order functions by transforming input variables, thus solving complex problems. The paragraph concludes by discussing the wide range of applications for regression, including time series prediction, classification, data reduction, trend analysis, and risk factor analysis.

Mindmap

Keywords

💡Supervised Learning

Supervised learning is a type of machine learning where the algorithm is trained on labeled data, meaning that the input data comes with the correct output or result. In the context of the video, supervised learning is used to build a classifier to predict whether a customer will buy a computer or not based on attributes like age and income. The process involves creating a function that maps inputs to the correct output labels.

💡Classifier

A classifier is a function or a model in machine learning used to classify input data into predefined categories or classes. In the video, the goal is to create a classifier that can predict a binary outcome—whether a customer will buy a computer (yes/no). The classifier is trained using labeled data and then used to make predictions on new, unseen data.

💡Attributes

In the context of machine learning, attributes refer to the features or characteristics of the data that are used to describe the instances within the dataset. In the video, age and income are the attributes used to describe customers in the database, which the classifier will use to make predictions about computer purchases.

💡Geometric Interpretation

The geometric interpretation in machine learning refers to visualizing data points in a multi-dimensional space where each dimension represents an attribute. The video discusses using this interpretation to define functions, such as lines or curves, that can separate different classes of data points, like distinguishing between customers who will and will not buy a computer.

💡Inductive Bias

Inductive bias is the set of assumptions that are built into a learning algorithm to help it make generalizations from the training data to new, unseen data. The video mentions two types of inductive biases: language bias, which refers to the type of model (e.g., linear, polynomial), and search bias, which refers to the strategy used to find the best model. These biases are crucial for generalizing from a few training points to the entire input space.

💡Training Set

A training set is a subset of the data used to train a machine learning model. It consists of input-output pairs that the model learns from. In the video, the training set is used to teach the classifier how to predict whether a customer will buy a computer based on their age and income.

💡Validation

Validation is the process of evaluating a machine learning model's performance on a separate set of data that was not used during the training phase. The video describes using a test set or validation set to assess the classifier's accuracy and make adjustments if necessary to improve its performance.

💡Overfitting

Overfitting occurs when a model is too complex and learns the training data too well, including its noise and outliers, which can negatively impact its performance on new, unseen data. The video warns against overfitting, illustrating it with examples of overly complex curves that fit the training data but do not generalize well.

💡Linear Regression

Linear regression is a statistical method for modeling the relationship between dependent variable and one or more independent variables by fitting a linear equation to the observed data. In the video, linear regression is discussed in the context of predicting continuous values, such as temperature at different times of the day, by finding the best-fit line that minimizes the sum of the squares of the errors.

💡Regression

Regression, in the context of the video, refers to a broader set of techniques used for predicting continuous outcomes based on input variables. It includes linear regression but can also involve more complex models. The video uses regression to illustrate how to predict values like temperature and how it can be adapted for classification by using the sign of the predicted value.

💡Feature Transformation

Feature transformation involves changing the original features or attributes of the data into new forms that may be more suitable for the model to learn from. In the video, the concept is illustrated by squaring the input variables or combining them in products, which allows linear regression to fit more complex relationships within the data.

Highlights

Introduction to supervised learning and the concept of using experience with labeled data to predict outcomes such as whether a customer will buy a computer.

Description of the customer database using two attributes, age and income, for the purpose of classification.

The goal of creating a function or mapping to predict binary outcomes (yes/no) based on input attributes.

Geometric interpretation of data points and the natural approach of defining functions through lines or curves in the input space.

Simple function definition based on income threshold to classify computer buyers, ignoring the age variable initially.

Improvement of the classification function by considering both age and income, leading to a more complex but accurate model.

The trade-off between model complexity and performance, with the example of a quadratic function providing better accuracy but at a higher complexity.

The issue of overfitting when a model becomes too complex and tries to predict noise in the training data.

The concept of inductive biases in machine learning, including language bias and search bias, which influence the type of model and the search process for the best model.

The process of training a model with a training set and evaluating it with a test or validation set to ensure generalization.

The iterative process of adjusting the model based on error comparison between predicted and actual outputs.

Applications of supervised learning in various fields such as fraud detection, sentiment analysis, churn prediction, and medical diagnosis.

The difference between supervised learning for classification and regression, with regression predicting continuous values like temperature.

The challenge of avoiding overfitting in regression by finding the right balance between model complexity and prediction error.

The versatility of linear regression in fitting higher-order functions by transforming input variables, thus solving complex problems.

Practical applications of regression in time series prediction, classification using regression curves, data reduction, trend analysis, and risk factor analysis.

Transcripts

play00:13

So in this module we will look  at supervised learning right.

play00:19

If you remember in supervised learning we talked  about experience right where you have some kind of  

play00:26

a description of the data. So in this case let  us assume that I have a customer database and  

play00:31

I am describing that by two attributes here,  age and income. So I have each customer that  

play00:38

comes to my shop I know the age of the customer  and the income level of the customers right.

play00:42

And my goal is to predict whether the customer  will buy a computer or not buy a computer right.  

play00:51

So I have this kind of labeled data that is  given to me for building a classifier right,  

play00:58

remember we talked about classification where  the output is a discrete value in this case  

play01:04

it is yes or no, yes this is the person  will buy a computer, no the person will  

play01:08

not buy a computer. And the way I describe the  input is through a set of attributes in this  

play01:14

case we are looking at age and income as the  attributes that describe the customer right.

play01:22

And so now the goal is to  come up with a function right,  

play01:28

come up with a mapping that will take  the age and income as the input and it  

play01:33

will give you an output that says the  person will buy the computer or not buy  

play01:37

the computer. So there are many different  ways in which you can create this function.

play01:43

And given that we are actually looking at  a geometric interpretation of the data,  

play01:47

I am looking at data as points in space, the  one of the most natural ways of thinking about  

play01:53

defining this function is by drawing lines  or curves on the input space right. So here  

play01:59

is one possible example, so here I have drawn  a line and everything to the left of the line  

play02:06

right. So these are points that are red right,  so everything to the left of the line would  

play02:14

be classified as will not buy a computer,  everything to the right of the line where  

play02:22

the predominantly the data points are blue  will be classified as will buy a computer.

play02:26

So how would the function look like, it will  look like something like if the income of a  

play02:33

person remember that the x-axis is income  and the y-axis is age. So in this case it  

play02:39

basically says that if the income of the  person is less than some value right,  

play02:44

less than some X then the person will not buy  a computer. If the income is greater than X the  

play02:52

person will buy your computer. So that is the  kind of a simple function that we will define.

play02:58

Just notice that way we completely ignore one  of the variables here which is the age. So we  

play03:02

are just going by income, if the income is  less than some X then the person will not  

play03:06

buy a computer, if the income is greater  than X the person will buy a computer. So  

play03:10

is this a good rule more or less I mean  we get most of the points correct right  

play03:16

except a few right. So it looks like  yeah, we can we can survive with this  

play03:22

rule right. So this is not too bad right,  but then you can do slightly better.

play03:28

All right, so now we got those two red  points that those just keep that points  

play03:32

are on the wrong side of the line earlier  now seem to be on the right side right. So  

play03:36

everything to the left of this line will not  buy a computer, everything to the right will  

play03:41

buy a computer right, everyone moves to the  right will buy a computer. So if you think  

play03:46

about what has happened here, so we have  improved our performance measure right.

play03:50

So the cost of something, so what is the  cost here. So earlier we are only paying  

play03:56

attention to the income right, but now  we have to pay attention to the age as  

play04:01

well right. So the older you are right, so  the income threshold at which we will buy  

play04:08

a computer is higher right. So the younger  you are, younger means lower on the y axis,  

play04:14

so the younger you are the income threshold at  which you will buy a computer is lower right.

play04:19

So is that clear, so the older you are,  the income threshold is shifted to the  

play04:30

right here. So the older you are, you need to  have a higher income before you buy a computer  

play04:35

and the younger you are your income threshold  is lower, so you do not mind buying a computer  

play04:40

even if your income is slightly lesser. So  now we have to start paying attention to  

play04:46

the age right, but then the advantage is  you get much better performance right.

play04:51

Can you do better than this yes? Now almost  everything is correct except that one pesky  

play04:59

red point, but everything else is correct.  And so what has happened here we get much  

play05:04

better performance, but at the cost of  having a more complex classifier right.

play05:11

So earlier if you thought about it in geometric  terms, so first you had a line that was parallel  

play05:18

to the y-axis therefore, I just needed to  define a intercept on the x-axis right. So  

play05:24

if X is less than some value then it was  one class was greater than some value was  

play05:31

another class. Then the second function  it was actually a slanting line like that,  

play05:36

so I needed to define both the  intercept and the slope right.

play05:40

And now here it is now a quadratic so I  have to define three parameters right.  

play05:44

So I have to define something like  ax2+ bx+c, so I have defined the a,  

play05:48

b, c – the three parameters in order to find the  quadratic, and I am getting better performance.

play05:52

So can you do better than this? Okay that  somehow does not seem right correct seems  

play06:01

to be too complex a function just to be getting  this one point there right. And I am not sure I  

play06:08

am not even sure how many parameters you need  for drawing that because Microsoft use some  

play06:12

kind of spline PowerPoint use some kind of  spline interpolation to draw this curve I am  

play06:16

pretty sure that it is got lot more parameters  than it is worth. Another thing to note here  

play06:22

is that that particular red point that you see  is actually surrounded by a sea of blue right.

play06:30

So it is quite likely that there was some  glitch there either the person actually  

play06:35

bought a computer and we never we have  not recorded it has been having bought a  

play06:38

computer or there are some extremist reason the  person comes into the shop sure that is going to  

play06:47

buy a computer but then gets a phone call saying  that some emergency please come out immediately  

play06:51

and therefore he left without buying a computer  right there could be variety of reasons for why  

play06:55

that noise occurred and this will probably  be the more appropriate classifier right.

play07:01

So these are the kinds of issues I would like  to think about what is the complexity of the  

play07:04

classifier that I would like to have right and  versus the accuracy of the classifier, so how  

play07:10

good is the classifier in actually recovering  the right input output map and or their noise  

play07:17

data in the in the input in the experience that  I am getting is it clean or is there noise on  

play07:24

it and if so how do I handle that noise these are  the kinds of issues that we have to look at okay.

play07:30

So these kinds of lines that we drew right  kind of hiding one assumption that we are  

play07:44

making so the thing is the data that comes  to me comes as discrete points in the space  

play07:51

right and from these discrete points  in the space I need to generalize and  

play07:56

be able to say something about the entire  state space right so I do not care where  

play08:01

the data point is on the x and y-axis right I  should be able to give a label to that right.

play08:07

If I do not have some kind of assumption  about these lines right and if you do not  

play08:12

have some kind of assumptions about these lines  the only thing I can do is if the same customer  

play08:17

comes again or somebody who has exact same age  and income as that cause customer comes again  

play08:22

I can tell you whether the person is going to  buy a computer or not buy a computer but I will  

play08:27

not be able to tell you about anything  else outside of the experience right.

play08:31

So the assumption we made is everything to  the left of a line is going to do one thing  

play08:38

or the other; so everything to the left of the  line will not buy the computer everything to  

play08:42

the right or everyone to the right will buy  a computer this is an assumption I made the  

play08:46

assumption was the lines are able to segregate  people who buy from who do not buy the lines or  

play08:54

the curves were able to segregate people who will  buy from who will not buy so that is a kind of an  

play08:59

assumption I made about the distribution  of the input data and the class labels.

play09:07

So this kind of assumptions that we make about  these lines are known as inductive biases in  

play09:13

general inductive bias has like two different  categories one is called language bias which is  

play09:18

essentially the type of lines that I am  going to draw. Am I gonna draw straight  

play09:22

lines or am I going to draw curves and what  order polynomials am I going to look at and  

play09:28

so on so forth these for my language bias. And  search bias is the other form of inductive bias  

play09:34

that tells me how in what order am I going  to examine all these possible lines right.

play09:40

So that gives me the gives me a search bias  right, so putting these things together we  

play09:47

are able to generalize from a few training  points to the entire space of inputs right  

play09:53

I will make this more formal as we go on  and then in the next set of modules right.

play10:00

And so here is one way of looking at the whole  process so I am going to be giving you a set  

play10:08

of data which we will call the training set.  So the training set will consist of say as an  

play10:14

input which we'll call as X and an output which  we call as Y right, so I am going to have a set  

play10:19

of inputs I have X1, X2, X3, X4 likewise I will  have Y1, Y2, Y3, Y4 and this data is fed into a  

play10:34

training algorithm right and so the data is  going to look like this in our case right.

play10:40

So remember our X’s are the input variable X’s  are the inputs so in this case that should have  

play10:46

the income and the age, so x1 is like 30,000  and 25 and x2 is like 80,000 and 45 and so  

play10:52

on so forth and the y's or the labels they  correspond to the colors in the previous  

play10:58

picture right so y1 does not buy a computer  y2 buys a computer and so on so forth so this  

play11:03

essentially gives me the color coding so y1  is essentially red and y2 is blue right and  

play11:09

I really if I am going to use something numeric  this is what we will be doing later on I really  

play11:14

cannot be using these values. First of all wise  or not numeric and the X’s vary too much right.

play11:22

So the first coordinate in the X is like  30,000 and 80,000 and so on so forth and  

play11:26

the second coordinate is like 25 and 45 so  that is a lot a lot smaller in magnitude so  

play11:34

this will lead to some kind of numerical  instabilities. So what will typically end  

play11:38

up doing is normalizing these so that they  form approximately in the same range so you  

play11:45

can see that I have try to normalize  these X values between 0 and 1 right.

play11:50

So have chosen an income level of say 2 lakhs it  is the maximum and age of 100 and you can see the  

play11:58

normalized values and likewise for buys and not  buy I have taken not by as - 1 and by as computer  

play12:05

is + 1. These are arbitrary choices, now but later  on you will see that there are specific reasons  

play12:11

for wanting to choose this encoding in this way.  And then the training algorithm chugs over this  

play12:19

data right and it will produce a classifier so  now this classifier I do not know whether it is  

play12:24

good or bad right so we had a straight line in the  first case right an axis parallel line if we did  

play12:29

not know the good or bad and we needed to have  some mechanism by which we evaluate this right.

play12:34

So how do we do the evaluation typically is  that you have what is called a test set or a  

play12:39

validation set right so this is another set of  x and y pairs like we had in the training set,  

play12:44

so again in the test set we know what the labels  are it is just that we are not showing it to the  

play12:52

training algorithm we know what the labels are  because we need to use the correct labels to  

play12:57

evaluate whether your training algorithm is  doing good or bad right so, so this process  

play13:03

by which this evaluation happens is called  validation. Then at the end of the validation,  

play13:07

if you are happy with the quality of the  classifier we can keep it. If you are not  

play13:11

happy then go back to the training algorithm and  say hey I am not happy with what you produced  

play13:15

give me something different right, so we have  to either iterate over the algorithm again we  

play13:21

will go over the data again and try to refine  the parameter estimation or we could even think  

play13:25

of changing some parameter values and then trying  to redo the training algorithm all over again. But  

play13:31

this is the general process and we will see that  many of the different algorithms that we look,  

play13:36

look at in the course; during the course of these  lectures actually follow this kind of a process .

play13:44

So what happens inside that green box? So  inside the training algorithm is that there  

play13:50

will be this learning agent right which will  take an input and it will produce an output Ŷ  

play13:55

which it thinks is the correct output right  but it will compare it against the actual  

play13:59

target Y it was given for the in the training  right, so in the training you actually have a  

play14:05

target Y so it will compare it against a target  why right and then figure out what the error is  

play14:10

and use the error to change the agent right  so then it can produce the right output next  

play14:17

time around. This is essentially an iterative  process so you see that input okay produce an  

play14:22

output Ŷ and then you take the target Y, you  can compare it to the Ŷ, figure out what is  

play14:28

the error and use the error to change the agent  again right. And this is by and large the way  

play14:34

most of the learning algorithms will operate;  most of the classification algorithms or even  

play14:39

regression algorithms will operate and we will  see how each of this works as, we go on right.

play14:46

There are many, many applications. I mean this is  too numerous to list. Here are a few examples you  

play14:51

could look at say a fraud detection right, so  we have some data where the input is a set of  

play14:56

transactions made by a user and then you can  flag each transaction as a valid transaction  

play15:04

or not. You could look at sentiment analysis  you know variedly called as opinion mining or  

play15:10

buzz analysis etc., where I give you a piece of  text or a review written about a product or a  

play15:16

movie and then you tell me whether the movies  whether the review is positive or whether is  

play15:21

negative and what are the negative points  that people are mentioning about and so on  

play15:24

so forth and this again a classification task.  Or you could use it for doing churn prediction  

play15:30

where you are going to say whether a customer  who is in the system is likely to leave your  

play15:34

system or is going to continue using your product  or using your service for a longer period of time,  

play15:39

so this is essentially churn so when a  person leaves your services you call the  

play15:43

person churner and you can label what the  person is churner or not. And I have been  

play15:47

giving you examples form medical diagnosis  all through apart from actually diagnosing  

play15:53

whether a person has a disease or not you could  also use it for risk analysis in the slightly  

play15:57

indirect way I will talk about that when we  when we do the algorithms for classification.

play16:05

So we talked about how we are interested in  learning different lines or curves that can  

play16:17

separate different classes in supervised learning  and, so this curves can be represented using  

play16:23

different structures and throughout the course  we will be looking at different kinds of learning  

play16:28

mechanisms like artificial neural networks support  vector machines decision trees nearest neighbors  

play16:34

and Bayesian networks and these are some of the  popular ones and we look at these in more detail  

play16:40

as the course progresses. So another supervised  learning problem is the one of prediction.

play16:47

Or regression where the output that you are going  to predict is no longer a discrete value it is  

play16:54

not like – will buy a computer or does not  buy a computer – it is more of a continuous  

play16:59

value so here is an example, where at different  times of day you have recorded the temperature  

play17:06

so the input to the system is going to be the  time of day and the output from the system is  

play17:12

going to be the temperature that was measured  at a particular point at the time right. So you  

play17:18

are going to get your experience or your training  data is going to take this form so the blue points  

play17:23

would be your input and the red points would be  the outputs that you are expected to predict.  

play17:27

So note here that the outputs are continuous or  real value right and so you could think of this in  

play17:35

this toy example as points to the left being day  and the points to the right being night right. And  

play17:42

just as in the previous case of classification,  so we could try to do these simple as possible  

play17:47

fit in this case which would be to draw a straight  line that is as close as possible to these points  

play17:53

now you do see that like in the classification  case when it choose a simple solution there are  

play18:00

certain points at which we are making large  errors right so we could try to fix that.

play18:05

And try to do something more fancy. But you could  see that while the day time temperatures are more  

play18:13

or less fine but with the night times we seem to  be doing something really off right because we  

play18:19

are going off too much to the right-hand side.  How if you could do something more complex just  

play18:26

like in the classification case where we wanted  to get that one point right so we could try and  

play18:31

fit all these temperatures that were given to  us by looking at a sufficiently complex curve.

play18:37

And again this as we discussed earlier is  probably not the right answer and you are  

play18:44

probably in this case surprisingly or better  off fitting the straight line right. So these  

play18:51

kinds of solutions where we try to fit the  noise in the data we are trying to make the  

play18:58

solution predict the noise in the training  data correctly are known as over fitting,  

play19:04

over fit solutions and one of the things that  we look to avoid in, in machine learning is  

play19:11

to over fit to the training data. So we  will talk about this again in due course.

play19:18

So what we do is typically we would like to do  what is called linear regression some of you  

play19:22

might have come across this under different  circumstances and the typical aim in linear  

play19:28

regression is to say take the error that your  line is making so if you take an example point,  

play19:37

let us say I take any I take an  example point somewhere here right.

play19:48

So this is the actual training data  that is given to you and this is the  

play19:52

prediction that your line is making at this  point so this quantity is essentially the,  

play19:58

the prediction error that this line is making  and so what you do is you try to find that line  

play20:05

that has the least prediction error right  so you take the square of the errors that  

play20:12

your prediction is making and then you try  to minimize the, the sum of the squares of  

play20:18

the errors. Why do we take the squares? Because  errors could be both positive or negative and  

play20:22

we want to make sure that you are minimizing  that regardless of the sign of the error okay.

play20:27

So with sufficient data right so a linear  regression is simple enough you could just  

play20:33

solve it using matrix inversions as we will see  later but with many dimensions the challenge is to  

play20:39

avoid over fitting like we talked about earlier  and then there are many ways of avoiding this.

play20:43

And so I will again talk about this in  detail when we look at linear regression  

play20:49

right. So one point that I want to make is  that linear regression is not as simple as  

play20:54

it sounds right. So here is an example so  I have two input variables x1 and x2 right  

play21:01

and if I try to fit a straight line with x1  and x2 I will probably end up with something  

play21:05

like a1 x1 plus a2 x2 right and that looks  like, like a plane in two dimensions right.

play21:13

But then if I just take these two dimensions  and then transform them transform the input  

play21:20

so instead of saying just the x1 and x2 if I  say my input is going to look like x1 square,  

play21:25

x2 squared, x1x2 and then the x1 and x2  as it was in the beginning so instead  

play21:31

of looking at a two-dimensional input if I am  going to look at a 5 dimensional input right,  

play21:37

and now I am going to fit a line or a linear  plane in this 5 dimensional input so that will  

play21:44

be like a1 x1 squared plus a2 x2 square plus  a3 x1 x2 plus a4 x1 plus a5 x2. Now that is no  

play21:55

longer the equation of a line in two dimensions  right so that is the equation of a second-order  

play22:00

polynomial in two dimensions but I can still  think of this as doing linear regression because  

play22:07

I am only fitting a function that is going  to be linear in the input variables right.

play22:16

So by choosing an appropriate transformation of  the inputs, I can fit any higher-order function  

play22:22

so I could solve very complex problems using  linear regression and so it is not really a  

play22:26

weak method as you would think at first, first  glance. Again, we will look at this in slightly  

play22:32

more detail in the later lectures right and  regression our prediction can be applied in  

play22:38

a variety of places – one popular place is in time  series prediction you could think about predicting  

play22:42

rainfall in a certain region or how much you are  going to spend on your telephone calls you could  

play22:48

think of doing even classification using this, if  you think of; you remember our encoding of +1 and  

play22:54

-1 for the class labels. So you could think of +1  and -1 as the outputs right and then you can fit a  

play23:00

regression line regression curve to that and if  the output is greater than 0 you would say this  

play23:07

class is +1 its output is less than 0 you see  the class is -1 so it could use the regression  

play23:13

ideas to solve the classification problem and you  could also do data reduction. So I really do not  

play23:21

want to give you all the millions of data points  that I have in my data set but what I would do is  

play23:29

essentially fit the curve to that and then give  you just the coefficients of the curve right.

play23:34

And more often than not that is sufficient for us  to get a sense of the data and that brings us to  

play23:41

the next application I have listed there which is  trend analysis so I am not really interested in;  

play23:46

quite many times, I am not interested in the  actual values of the data but more in the,  

play23:51

the trends so for example I have  a solution that I am trying to  

play23:58

measure the running times off and  I am not really interested in the  

play24:01

actual running time because with 37seconds  to 38 seconds is not going to tell me much.

play24:05

But I would really like to know if the running  time scales linearly or exponentially with the  

play24:10

size of the important all right so those  kinds of analysis again can be done using  

play24:14

regression. And in the last one here is  again risk factor analysis like we had  

play24:19

in classification and you can look at  which are the factors that contribute  

play24:23

most to the output so that brings us to the  end of this module on supervised learning.

Rate This

5.0 / 5 (0 votes)

関連タグ
Supervised LearningClassificationRegressionMachine LearningData AnalysisPredictive ModelingCustomer BehaviorFraud DetectionSentiment AnalysisRisk Analysis
英語で要約が必要ですか?