Probabilistic view of linear regression

IIT Madras - B.S. Degree Programme
23 Sept 202215:13

Summary

TLDRThis video script explores the probabilistic perspective of linear regression, treating it as an estimation problem. It explains how the labels are generated through a probabilistic model involving feature vectors, weights, and Gaussian noise. The script delves into the maximum likelihood approach to estimate the weights, leading to the same solution as linear regression with squared error. It emphasizes the equivalence between choosing a noise distribution and an error function, highlighting the importance of Gaussian noise in justifying the squared error commonly used in regression analysis.

Takeaways

  • šŸ“Š The script introduces a probabilistic perspective on linear regression, suggesting that labels are generated through a probabilistic model involving data points and noise.
  • šŸ” It explains that in linear regression, we are not modeling the generation of features themselves but the probabilistic relationship between features and labels.
  • šŸŽÆ The model assumes that each label y_i is generated as w^T x_i + epsilon, where epsilon is noise, and w is an unknown fixed parameter vector.
  • šŸ“š The noise epsilon is assumed to follow a Gaussian distribution with a mean of 0 and a known variance sigma^2.
  • šŸ§© The script connects the concept of maximum likelihood estimation to the problem of linear regression, highlighting that the maximum likelihood approach leads to the same solution as minimizing squared error.
  • šŸ“‰ The likelihood function is constructed based on the assumption of independent and identically distributed (i.i.d.) data points, each following a Gaussian distribution influenced by the model's parameters.
  • āœļø The log-likelihood is used for simplification, turning the product of probabilities into a sum, which is easier to maximize.
  • šŸ”§ By maximizing the log-likelihood, the script demonstrates that the solution for w is equivalent to the solution obtained from traditional linear regression with squared error loss.
  • šŸ“ The conclusion emphasizes that the maximum likelihood estimator with 0 mean Gaussian noise is the same as the solution from linear regression with squared error, highlighting the importance of the noise assumption.
  • šŸ”„ The script points out that the choice of error function in regression is implicitly tied to the assumed noise distribution, and different noise assumptions would lead to different loss functions.
  • šŸŒ Viewing linear regression from a probabilistic viewpoint allows for the application of statistical estimator theory to understand and analyze the properties of the estimator, such as hat{w}_{ML}.
  • šŸ’” The script suggests that by adopting a probabilistic approach, we gain insights into the properties of the estimator and the connection between noise statistics and the choice of loss function in regression.

Q & A

  • What is the main focus of the script provided?

    -The script focuses on explaining the probabilistic view of linear regression, where the relationship between features and labels is modeled probabilistically with the assumption of a Gaussian noise model.

  • What is the probabilistic model assumed for the labels in this context?

    -The labels are assumed to be generated as the result of the dot product of the weight vector 'w' and the feature vector 'x', plus some Gaussian noise 'Īµ' with a mean of 0 and known variance ĻƒĀ².

  • Why is the probabilistic model not concerned with how the features themselves are generated?

    -The model is only concerned with the relationship between features and labels, not the generation of the features themselves, as it is focused on estimating the unknown parameters 'w' that affect the labels.

  • What is the significance of the noise term 'Īµ' in the model?

    -The noise term 'Īµ' represents the random error or deviation in the label 'y' from the true linear relationship 'w transpose x', and it is assumed to follow a Gaussian distribution with 0 mean and variance ĻƒĀ².

  • How does the assumption of Gaussian noise relate to the choice of squared error in linear regression?

    -The assumption of Gaussian noise with 0 mean justifies the use of squared error as the loss function in linear regression, as it implies that the likelihood of observing a label 'y' given 'x' is maximized when the squared error is minimized.

  • What is the maximum likelihood approach used for in this script?

    -The maximum likelihood approach is used to estimate the parameter 'w' by finding the value that maximizes the likelihood of observing the given dataset, under the assumption of the probabilistic model described.

  • What is the form of the likelihood function used in the script?

    -The likelihood function is the product of the Gaussian probability densities for each data point, with each density having a mean of 'w transpose xi' and variance ĻƒĀ².

  • Why is the log likelihood used instead of the likelihood function itself?

    -The log likelihood is used because it simplifies the optimization process by converting the product of likelihoods into a sum, which is easier to maximize.

  • What is the solution to the maximization of the log likelihood?

    -The solution is to minimize the sum of squared differences between the predicted labels 'w transpose xi' and the actual labels 'yi', which is the same as the solution to the linear regression problem with squared error.

  • How does the script connect the choice of noise distribution to the loss function used in regression?

    -The script explains that the choice of noise distribution (e.g., Gaussian) implicitly defines the loss function (e.g., squared error), and vice versa, because the loss function reflects the assumed statistical properties of the noise.

  • What additional insights does the probabilistic viewpoint provide beyond the traditional linear regression approach?

    -The probabilistic viewpoint allows for the study of the properties of the estimator 'w hat ML', such as its statistical properties and robustness, which can lead to a deeper understanding of the regression problem.

Outlines

00:00

šŸ“Š Probabilistic View of Linear Regression

This paragraph introduces the concept of viewing linear regression through a probabilistic lens, where labels are generated by a probabilistic model rather than a deterministic one. It explains that the model assumes labels are produced as the result of a linear combination of features plus some noise (epsilon), which is assumed to be Gaussian with a mean of zero and a known variance (sigma squared). The focus is on estimating the unknown parameter 'w', which represents the relationship between features and labels. The paragraph also outlines the maximum likelihood approach as a method to estimate 'w', setting the stage for further exploration into the probabilistic underpinnings of linear regression.

05:06

šŸ” Maximum Likelihood Estimation in Linear Regression

The second paragraph delves into the specifics of using the maximum likelihood estimation (MLE) to solve the linear regression problem. It discusses the likelihood function, which is based on the assumption of independent and identically distributed (i.i.d.) Gaussian noise with zero mean and a known variance. The paragraph explains how the likelihood function is constructed and how taking the logarithm of the likelihood simplifies the maximization process. It also highlights that maximizing the likelihood is equivalent to minimizing the sum of squared differences between the predicted and actual labels, a problem already familiar from standard linear regression. The solution to this minimization problem is identified as the MLE estimator, which is the same as the solution obtained from the least squares method in linear regression.

10:11

šŸ”— Equivalence of MLE and Squared Error in Linear Regression

The final paragraph emphasizes the equivalence between the maximum likelihood estimator assuming zero-mean Gaussian noise and the solution to the linear regression problem with squared error. It points out that choosing squared error as the loss function implicitly assumes Gaussian noise in the model. The paragraph also discusses the implications of this connection, noting that if the noise were not Gaussian, the solutions to the MLE and squared error problems would no longer be equivalent, and a different loss function would be appropriate. Additionally, it suggests that viewing the problem probabilistically allows for a deeper understanding of the estimator's properties and introduces the idea that further exploration into the properties of the MLE estimator will be conducted.

Mindmap

Keywords

šŸ’”Probabilistic View

A probabilistic view in the context of this video refers to the approach of considering linear regression as a process influenced by probabilistic models. This perspective assumes that there is an underlying probability distribution that generates the observed data. The video discusses how this view can offer insights into the estimation problem in linear regression, where the labels are considered to be generated by a probabilistic mechanism involving the features and some noise.

šŸ’”Linear Regression

Linear regression is a statistical method for modeling the relationship between a dependent variable and one or more independent variables by fitting a linear equation to observed data. In the video, linear regression is re-examined from a probabilistic standpoint, where the labels (dependent variable) are assumed to be generated by a linear combination of features plus some random noise, which is a key aspect of the probabilistic model discussed.

šŸ’”Estimation Problem

An estimation problem in statistics involves finding estimates for parameters based on observed data. In the video, the estimation problem specifically refers to estimating the parameters (weights) of a linear regression model. The script discusses how the probabilistic model of linear regression can be framed as an estimation problem where the goal is to find the best-fitting line that minimizes the impact of noise on the observed labels.

šŸ’”Data Points

Data points are individual observations in a dataset, often represented as pairs of feature values and corresponding labels. The script mentions data points in the context of a dataset where each point is a combination of features (x) and a label (y), which is assumed to be generated by a linear relationship with added noise.

šŸ’”Labels

In the context of the video, labels refer to the dependent variable or the output values in a dataset that are associated with each data point. The script explains that labels are generated by the model as a linear combination of the features plus some noise, emphasizing that the labels are not directly observed but are instead corrupted by noise.

šŸ’”Noise

Noise in this video represents random fluctuations or errors in the observed data that are not explained by the model. It is a critical component of the probabilistic model for linear regression, where the script assumes that the noise is Gaussian with a mean of zero and a known variance, which affects the observed labels.

šŸ’”Gaussian Distribution

A Gaussian distribution, also known as a normal distribution, is a probability distribution that is characterized by a bell-shaped curve. In the script, the noise added to the labels in the linear regression model is assumed to follow a Gaussian distribution with a mean of zero and a known variance, which is a key assumption in the probabilistic model.

šŸ’”Maximum Likelihood

Maximum likelihood is a method used in statistics to estimate the parameters of a model. The script discusses using the maximum likelihood approach to estimate the weights of the linear regression model. By assuming a probabilistic model for the generation of labels, the likelihood function can be maximized to find the best-fitting parameters.

šŸ’”Log Likelihood

Log likelihood is the logarithm of the likelihood function and is often used in maximum likelihood estimation because it simplifies the calculation by converting products into sums. The video script explains that taking the log of the likelihood function makes it easier to work with when finding the parameters that maximize the likelihood of observing the given data.

šŸ’”Error Function

An error function in the context of regression analysis measures the difference between the predicted values and the actual values. The script discusses the squared error function, which is the sum of the squared differences between the predicted labels and the observed labels. This function is central to the video's discussion on how the choice of error function is related to the assumed noise distribution.

šŸ’”Pseudo Inverse

The pseudo inverse, often denoted as (A^+), is a generalization of the inverse of a matrix for non-square matrices. In the script, the pseudo inverse is used in the context of finding the maximum likelihood estimator for the weights in linear regression, which is given by the formula (X^T * X)^+ * X^T * y, where X is the matrix of features and y is the vector of labels.

Highlights

Introduction to a probabilistic view of linear regression, treating it as a probabilistic model for generating labels.

Assumption of a probabilistic mechanism that generates labels based on data points and noise.

Discussion on the probabilistic model where labels are generated as the sum of a feature's dot product with weights and noise.

Clarification that the model does not attempt to model the generation of features themselves.

Introduction of noise as a Gaussian distribution with a mean of zero and known variance.

Explanation of the dataset generation process involving an unknown but fixed parameter w.

The problem is framed as an estimation problem where the goal is to estimate the weights w.

Introduction of the maximum likelihood approach as a method to estimate the weights.

Formulation of the likelihood function for the linear regression model with Gaussian noise.

Log-likelihood is used for ease of computation in the maximization process.

The maximization of the log-likelihood leads to the minimization of the squared error, a familiar problem in linear regression.

Derivation of the maximum likelihood estimator for linear regression, which matches the solution from the squared error approach.

Equivalence of the maximum likelihood estimator with squared error in linear regression under the assumption of Gaussian noise.

Implication that the choice of squared error in linear regression implicitly assumes Gaussian noise.

Discussion on the impact of noise statistics on the choice of loss function in regression models.

Exploration of alternative noise distributions and their corresponding loss functions, such as Laplacian noise leading to absolute error.

Advantage of the probabilistic viewpoint for studying the properties of estimators in linear regression.

Transcripts

play00:00

ļ»æ Ā 

play00:00

Now, what we are going to see is aĀ  probabilistic view of linear regression.Ā Ā 

play00:25

What happens when you think of linear regressionĀ  as if there is some probabilistic model thatĀ Ā 

play00:34

generates our labels. So, that is what, thatĀ  is what we are going to look at. So, we haveĀ Ā 

play00:40

already looked at estimation, in general in anĀ  unsupervised setting where we have seen maximumĀ Ā 

play00:45

likelihood Bayesian methods and so on. But now, weĀ  are going to think of our linear regression alsoĀ Ā 

play00:52

as in some sense an estimation problem, whichĀ  means that there should be some probabilisticĀ Ā 

play00:57

mechanism that we are going to assume thatĀ  generates something that we have seen. Ā 

play01:01

So, what is that we are going to assume?Ā  Well, in the linear regression problem,Ā Ā 

play01:05

you have the data points in d dimension, theĀ  labels are in real numbers and of course,Ā Ā 

play01:12

you have a dataset which I can write as x1Ā  y1 dot dot dot xn yn this is the dataset.Ā Ā 

play01:21

Now, the probabilistic model that we are goingĀ  to assume is as follows, we are going to assumeĀ Ā 

play01:30

that the label is generated as follows theĀ  label given the data point is generated asĀ Ā 

play01:41

w transpose x plus some noise epsilon. What does this mean? This means thatĀ Ā 

play01:50

I do not I am not trying to model how theĀ  features themselves are generated, I am justĀ Ā 

play01:56

trying to model the relationship between featuresĀ  and the labels in a probabilistic way, and what isĀ Ā 

play02:03

the probabilistic mechanism that generates theĀ  labels if I give you x, well, what we are goingĀ Ā 

play02:09

to posit or hypothesize is the following. So, if I give you a feature, then there is anĀ Ā 

play02:15

unknown but fixed w which is not known to us,Ā  so, this is the parameter of the problem. So,Ā Ā 

play02:22

this is unknown, but fixed and this is an Rd andĀ  whenever a feature is seen you do a w transpose x,Ā Ā 

play02:32

but then your y is not exactly w transpose x.Ā  So, this is the structure part of the problem. Ā 

play02:37

Now, we are going to explicitly say thereĀ  is also a noise part to the problem. So,Ā Ā 

play02:42

we are adding some noise to this w transpose xĀ  and that is this epsilon. So, this epsilon isĀ Ā 

play02:49

noise and this noise is we are going to assumeĀ  is a Gaussian distribution with 0 mean and someĀ Ā 

play02:57

known variance sigma squared. So, this is Gaussian, Gaussian.Ā Ā 

play03:06

So, so, now, what we are saying is that ourĀ  data set every yi was generated according toĀ Ā 

play03:14

this process somebody was gave us xi and thenĀ  to get the yi there is an unknown but fixed wĀ Ā 

play03:20

using which w transpose xi was generated and thenĀ  a noise got added and we are only seeing the noisyĀ Ā 

play03:26

version of w transpose xi whereas we know that theĀ  statistics of this noise is 0 mean and some (va)Ā Ā 

play03:32

known variance sigma squared, all that is known. So, but the only thing that is unknown for us is wĀ Ā 

play03:39

we do not know w so, which meansĀ  now, we can view the whole thingĀ Ā 

play03:43

as an estimation problem. So, now we can view we can view thisĀ Ā 

play03:54

as an estimation problem, what are we trying toĀ  estimate? Well, we are trying to estimate the wĀ Ā 

play04:02

which after adding noise affects our labels.Ā  So, once we have put down a model as to howĀ Ā 

play04:11

the data is generated, at least the y is given xĀ  is generated, and we have an unknown parameter. Ā 

play04:16

Now, we already know what some methodsĀ  to estimate come up with estimatorsĀ Ā 

play04:23

and the simplest method that we have alreadyĀ  seen. The solution approach to this problemĀ Ā 

play04:28

is, as you must have already guessed,Ā  is just the maximum likelihood approach.Ā Ā 

play04:38

So, now I want to understand the same problem,Ā  but then in a maximum likelihood context andĀ Ā 

play04:43

see what comes out of it. So, which means the standard maximumĀ Ā 

play04:47

likelihood problem, I am going toĀ  write the likelihood, so the likelihoodĀ Ā 

play04:54

function is going to look like this. Let us callĀ  this x. Now, what is the parameter of interest,Ā Ā 

play05:00

well the parameter of interest is W, but then theĀ  likelihood function also depends on the data x1Ā Ā 

play05:05

to xn and y1 to yn, because this is the observedĀ  data points, we are observing this and then weĀ Ā 

play05:12

are treating this as if it is a function of W. Though, it is also a function of the data pointsĀ Ā 

play05:16

and the labels, but then we are going toĀ  treat it as a function of w and then weĀ Ā 

play05:19

will try to find that w that maximizesĀ  our likelihood of seeing this, but thenĀ Ā 

play05:24

before that, what is this likelihood itself. Now, as usual, the i.i.d. assumptions hold in aĀ Ā 

play05:32

probabilistic model that is x1 y1 is independentlyĀ  generated. So, y1 is independently generated of y2Ā Ā 

play05:39

and so on and they are all from the same GaussianĀ  distribution. So, basically, this is going toĀ Ā 

play05:44

be product of i equals 1 to n. Now, what is theĀ  chance that I see a particular yi for a given xi?Ā Ā 

play05:57

Well, we know that every yi given xi isĀ  generated according to w transpose xi,Ā Ā 

play06:04

which is a fixed quantity, there is nothingĀ  random there and then you add a random noise. Ā 

play06:08

But this noise is 0 mean noise with a certainĀ  variance. So, if I add a constant fixed quantityĀ Ā 

play06:14

w transpose X site to the 0 mean Gaussian,Ā  just the mean gets shifted, the variance spreadĀ Ā 

play06:20

is is fixed, but just the mean moves around byĀ  adding a constant, let us say we have a 0 meanĀ Ā 

play06:25

Gaussian, I add 5 to it, it becomes a GaussianĀ  with mean 5, the variance is still the same. So,Ā Ā 

play06:31

it is exactly the same thing here. Now, I have added w transpose xi to thisĀ Ā 

play06:34

0 mean Gaussian for the ith data point. So, now,Ā  that would be a Gaussian distribution with meanĀ Ā 

play06:40

w transpose xi and variance sigma squared,Ā  which we are assuming is known. Ā 

play06:43

So, now this would then be the likelihoodĀ  can be written as the density of e,Ā Ā 

play06:49

which looks like e power, w transpose xi,Ā  which is the mean, minus what I observed,Ā Ā 

play06:58

which is yi squared by 2 sigma squared andĀ  of course, with 1 by square root 2 pi sigma.Ā Ā 

play07:06

Though let us content it does not really matterĀ  in our in our maximization as we will see. Ā 

play07:12

So, once we have put down this likelihood,Ā  I can now do the log likelihood log L of W,Ā Ā 

play07:19

with respect to the same parameters, x1 to xn y1Ā  to yn. We want to do the logarithm because it isĀ Ā 

play07:29

hard to deal with products, easier to deal withĀ  sums. So, this is sum over i equals one to n, theĀ Ā 

play07:36

log cancels the exponential. So, this is minus, wĀ  transpose xi minus yi squared by 2 sigma squaredĀ Ā 

play07:44

to 1 by square root 2 pi sigma. Now, remember,Ā  we want to think of this as a function of W.Ā Ā 

play07:52

x is our constant, sigma is our constant,Ā  everything else, y is our constant,Ā Ā 

play07:58

so it is only a function of W. And we want to see which w maximizes ourĀ Ā 

play08:01

likelihood or log likelihood, whichĀ  means I can equivalently equivalentlyĀ Ā 

play08:07

w star, I mean, to get the best w. I could haveĀ  maximized. So, the mean, I want to maximize overĀ Ā 

play08:19

W, sum over i equals 1 to n, I am going toĀ  remove, I do not care about these variables,Ā Ā 

play08:26

these are just constant scalings, these areĀ  known sigma squared is assumed to be known. So,Ā Ā 

play08:30

these are constants, I do not care aboutĀ  them. I will just hold on to the other guys. Ā 

play08:34

So, this is just w transpose xi minus yiĀ  squared, of course, the minus is there. Now,Ā Ā 

play08:39

this is equivalent to minimizing over w sum overĀ  i equals 1 to n w transpose xi minus yi squared.Ā Ā 

play08:52

Now, this minimization problem is somethingĀ  that we have already encountered. So,Ā Ā 

play08:59

this is exactly the linear regression problemĀ  with squared error that we already put out,Ā Ā 

play09:05

which means we know theĀ  solution to this. Ā 

play09:07

So, so basically, what is the solution toĀ  this? Well, then, the w hat ML as an estimatorĀ Ā 

play09:15

is exactly same as our w star, which we alreadyĀ  know is x x transpose pseudo inverse x yĀ Ā 

play09:26

from our previous discussionĀ  about linear regression.Ā Ā 

play09:30

Now, it is great that we started with a completelyĀ  different technique of looking at things which isĀ Ā 

play09:39

by thinking of a probabilistic mechanism forĀ  generating y given x and then did the mostĀ Ā 

play09:44

natural thing which is to look at a maximumĀ  likelihood approach and outcomes solution,Ā Ā 

play09:49

which is exactly same as theĀ  linear regression solution. Ā 

play09:53

So, what is the conclusion and it meritsĀ  separate writing here because in someĀ Ā 

play10:01

interesting points can be made theĀ  conclusion is that maximum likelihoodĀ Ā 

play10:11

estimator assuming this is the mostĀ  important part 0 mean, Gaussian noiseĀ Ā 

play10:30

is exactly same as linear regressionĀ Ā 

play10:43

with and again, this is theĀ  important part with squared error. Ā 

play10:55

We could have either solved the linear regressionĀ  part with squared error, or we could have treatedĀ Ā 

play11:02

this problem as a maximum likelihood problem withĀ  Gaussian noise with 0 mean Gaussian noise andĀ Ā 

play11:09

these both are exactly equivalent. These both areĀ  equivalent is is an important thing to understand,Ā Ā 

play11:15

because what exactly makes them equivalent isĀ  the choice of squared error when we startedĀ Ā 

play11:22

by looking at linear regression where we didĀ  not really justify a squared error that much,Ā Ā 

play11:26

we just said that, well, we will start withĀ  squared error because it looks like an intuitiveĀ Ā 

play11:30

thing to do do do, I mean, look at. Now, we are saying, well, more thanĀ Ā 

play11:35

just intuition, it has a very wellĀ  probabilistic backing as well. So,Ā Ā 

play11:39

we are saying that if we chose squared errorĀ  to solve the linear regression problem, it isĀ Ā 

play11:44

as if we are implicitly making the assumptionĀ  that there is a Gaussian noise 0 mean that getsĀ Ā 

play11:50

added to our labels that corrupts our labels. Now, these two are both important to understand.Ā Ā 

play11:57

So, if I had changed my noise, if there isĀ  reason to believe that my noise was not Gaussian,Ā Ā 

play12:02

then it would not be the same as solving theĀ  linear regression problem is squared error. Ā 

play12:08

So, the noise statistics are the density that youĀ  are assuming about the noise impacts, essentially,Ā Ā 

play12:14

the loss function, or the error function that weĀ  are using in our linear regression. That is theĀ Ā 

play12:19

connection that I want you to make here. So, for example, if I hadĀ Ā 

play12:23

given a different noise, for instance, if there ifĀ  I tried a different noise, like laplacian noise orĀ Ā 

play12:29

something like that, so I am just giving you anĀ  example, then no longer you would end up withĀ Ā 

play12:34

a linear regression with squared error, theseĀ  two will not be equivalent anymore, it wouldĀ Ā 

play12:39

be equivalent to a different loss function. Let us say in fact, and if if I mean just forĀ Ā 

play12:43

completions sake, if you use a laplacianĀ  noise, then it would be as if you are,Ā Ā 

play12:47

you are looking at the absolute differenceĀ  between w transpose xi and yi and then summingĀ Ā 

play12:52

up over them and that would be the problemĀ  that you are actually trying to solve. Ā 

play12:56

So, because the laplacian PDF has an absoluteĀ  value sitting at the top of exponential e powerĀ Ā 

play13:03

minus absolute value of w transpose xi minusĀ  yi, it it it kind of falls down sharper thanĀ Ā 

play13:09

the Gaussian distribution but that that isĀ  not the important point. The important pointĀ Ā 

play13:13

is that choosing a noise means implicitlyĀ  choosing an error function and vice versa,Ā Ā 

play13:18

choosing an error function means implicitlyĀ  choosing a noise, so that is the first connectionĀ Ā 

play13:22

that we want to make, which is which is importantĀ  so, good so, we have made that connection. Ā 

play13:27

The question is, is this the only thingĀ  that we gain, or are we gaining anythingĀ Ā 

play13:32

else by looking at this from a probabilisticĀ  viewpoint, well, the answer is yes, this is anĀ Ā 

play13:38

important conclusion that we are drawing thatĀ  if you view your w as an estimator, then youĀ Ā 

play13:45

can connect the noise and the loss. But now what else have we gained? So, what else,Ā Ā 

play13:54

what else have we gained,Ā Ā 

play13:59

by viewing this in a probabilistic way,Ā  well, the most important thing that weĀ Ā 

play14:05

might have (perf) perhaps gainedĀ  is that now we can study propertiesĀ Ā 

play14:15

of estimator especially to w hat ML.Ā Ā 

play14:24

So, this is an important gain that we have whenĀ  we view learning as a probabilistic mechanism,Ā Ā 

play14:33

because the moment you put probabilistic learningĀ  becomes estimation and once you have an estimator,Ā Ā 

play14:38

then you can bring in all the machinery thatĀ  we know about understanding estimators. WeĀ Ā 

play14:44

have already seen some kind of understanding whatĀ  are good estimators and whatever perhaps not. Ā 

play14:49

So, now, what we are going to see is can weĀ  somehow use this notion of estimators that somehowĀ Ā 

play14:55

can we use some properties of these estimatorsĀ  or some other way of trying to do estimation toĀ Ā 

play15:01

understand this problem of linear regressionĀ  better and that is what we will do next.

Rate This
ā˜…
ā˜…
ā˜…
ā˜…
ā˜…

5.0 / 5 (0 votes)

Related Tags
Linear RegressionProbabilistic ModelMaximum LikelihoodGaussian NoiseEstimation ProblemSquared ErrorData AnalysisMachine LearningStatistical MethodsParameter Estimation