Characteristics of Lasso regression
Summary
TLDRThe video script discusses the advantages of Ridge regression over Lasso, despite Lasso's tendency to produce sparse solutions. It highlights the lack of a closed-form solution for Lasso due to its non-differentiability at zero, necessitating iterative methods like subgradient descent. The script also touches on special techniques for solving Lasso, such as the Iteratively Reweighted Least Squares (IRLS), and concludes with an overview of linear regression, including least squares, stochastic gradient descent, and various regularizers like Ridge and Lasso, emphasizing their applications in machine learning.
Takeaways
- π The script discusses the comparison between Lasso and Ridge regression, highlighting the benefits and drawbacks of each method.
- π Lasso does not have a closed-form solution due to its non-differentiability at zero, unlike Ridge regression which has a straightforward closed-form solution.
- π To solve Lasso, subgradient methods are used because of the non-differentiability issue; these methods work even when the function is not differentiable at all points.
- π The concept of subgradients is introduced as a way to approximate the direction of steepest descent for non-differentiable points in optimization problems.
- π The script provides an intuitive explanation of subgradients, showing how they can be used to approximate the gradient in non-differentiable regions.
- π’ The absolute value function's subgradients are demonstrated, explaining how they can take any value between -1 and 1, representing different slopes that lower bound the function.
- π The definition of a subgradient is given, emphasizing its role in linearizing a function at a point and ensuring the function's value is always above this linearization.
- π§ Subgradient descent is presented as a useful algorithm for minimizing convex functions, even when they are not differentiable, by moving in the direction of the negative subgradient.
- π The relevance of subgradients to Lasso is highlighted, as Lasso is a convex optimization problem that can benefit from subgradient descent methods.
- π The script concludes that while Lasso provides sparse solutions, it lacks a closed-form solution and requires optimization techniques like subgradient descent or specialized methods like Iteratively Reweighted Least Squares (IRLS).
- π The summary of the course content on regression is provided, covering least squares, Ridge regression, Lasso, and the use of various regularizers in machine learning models.
Q & A
Why might someone choose Ridge regression over Lasso despite Lasso's ability to push some coefficients to zero?
-Ridge regression has a closed-form solution, which makes it computationally simpler and faster compared to Lasso, which requires iterative methods like subgradient descent due to its non-differentiability at zero.
What is a closed-form solution and why is it beneficial?
-A closed-form solution is an exact analytical expression that can be used to solve an equation or optimization problem. It is beneficial because it allows for direct computation of the solution without iterative methods, which can be more efficient and faster.
Why is Lasso not differentiable at zero and what are the implications for optimization?
-Lasso is not differentiable at zero due to the L1 penalty term, which is an absolute value function that is not smooth at zero. This non-differentiability means that traditional gradient-based optimization methods cannot be directly applied to Lasso, necessitating the use of subgradient methods.
What are subgradient methods and how do they differ from gradient methods?
-Subgradient methods are optimization techniques used for problems where the objective function is not differentiable at all points. They use subgradients, which are generalizations of gradients, to find a direction of descent even in non-differentiable regions. In contrast, gradient methods rely on the derivative of the function, which must exist at all points of interest.
Can you provide an example of a subgradient?
-A subgradient for a piecewise linear function at a non-differentiable point can be any line that lies below the function and touches it at that point. For instance, in a function with a 'V' shape, multiple lines can serve as subgradients at the vertex, each representing a different direction in which the function is lower-bounded.
What is the definition of a subgradient in the context of convex optimization?
-A subgradient of a function f at a point x is a vector g such that for all z, the function value f(z) is greater than or equal to f(x) + g^T (z - x). This means that the function lies above the linear approximation defined by the subgradient at x.
Why are subgradients useful in optimization, especially for non-differentiable functions?
-Subgradients are useful because they allow for the optimization of convex functions that may not be differentiable. By moving in the direction of the negative subgradient, one can still converge to the minimum of the function, provided it is convex.
What is the relationship between the L1 penalty and subgradients?
-The L1 penalty, which is the absolute value of a variable, is not differentiable at zero but has subgradients. At points other than zero, the subgradient is the sign of the variable, and at zero, any value between -1 and 1 can be a subgradient.
What are some alternative methods to solve the Lasso problem besides subgradient descent?
-Besides subgradient descent, other methods include the Iteratively Reweighted Least Squares (IRLS) method, which leverages the structure of the Lasso problem by solving a series of weighted least squares problems.
What is the significance of the Lasso problem being a convex optimization problem?
-The convexity of the Lasso problem ensures that any local minimum is also a global minimum. This property allows optimization algorithms like subgradient descent to find the optimal solution reliably, even though the problem may not have a closed-form solution.
Can you summarize the main differences between Ridge and Lasso regression in terms of their solutions and properties?
-Ridge regression uses an L2 penalty and has a closed-form solution, making it computationally efficient. It shrinks coefficients towards zero but does not set them exactly to zero. Lasso regression, on the other hand, uses an L1 penalty and does not have a closed-form solution. It can result in sparse solutions with some coefficients exactly at zero, but requires iterative optimization methods.
Outlines
π Lasso vs Ridge Regression and Subgradient Methods
This paragraph discusses the comparison between Lasso and Ridge regression, highlighting that Lasso does not have a closed-form solution due to its non-differentiability at zero caused by the L1 penalty. It explains the use of subgradient methods to solve such optimization problems where traditional gradient descent isn't applicable. The explanation includes an intuitive description of subgradients, using a piecewise linear function as an example to demonstrate how subgradients can provide a direction for optimization even when the gradient isn't defined. The paragraph concludes by emphasizing the iterative nature of solving Lasso regression problems.
π Understanding Subgradients in Optimization
The second paragraph delves deeper into the concept of subgradients, providing a formal definition and explaining their role in optimization, particularly for convex functions. It illustrates how subgradients can be used in place of gradients for optimization when the function isn't differentiable. The explanation includes a geometric interpretation of subgradients with respect to a one-dimensional absolute value function, showing multiple subgradients at points of non-differentiability. The paragraph also discusses the convergence properties of subgradient descent, a method applicable to convex functions, and its relevance to solving the Lasso problem, which is a convex optimization task.
π Specialized Methods for Solving Lasso Regression
This paragraph summarizes the challenges and methods for solving the Lasso regression problem, emphasizing that there is no closed-form solution due to its L1 penalty. It mentions the iterative re-weighted least squares (IRLS) method as a specialized approach to tackle Lasso problems, leveraging the solvability of least squares as a 'black box'. The paragraph also touches on the possibility of using general optimization techniques like subgradient descent for Lasso, given its convexity, and acknowledges the existence of other specialized methods without going into detail. It concludes by contrasting the ease of solving Ridge regression with the more complex nature of Lasso.
π Conclusion on Regression and Transition to Classification
The final paragraph wraps up the discussion on regression, providing a high-level summary of the concepts covered in the course, including least squares, stochastic gradient descent, maximum likelihood estimation, Bayesian linear regression, Ridge regression, and Lasso. It also mentions the existence of various regularizers and the possibility of incorporating domain-specific constraints into regularization. The paragraph concludes by signaling the transition to the next part of the course, which will focus on classification within the realm of supervised learning.
Mindmap
Keywords
π‘Lasso
π‘Ridge Regression
π‘Closed-form Solution
π‘Subgradient Methods
π‘Gradient Descent
π‘Convex Function
π‘L1 Norm
π‘Iterative Re-weighted Least Squares (IRLS)
π‘Sparsity
π‘Regularization
Highlights
Lasso does not have a closed form solution unlike Ridge regression, which complicates solving the optimization problem.
Subgradient methods are used to solve Lasso due to its non-differentiability at zero.
A subgradient is a direction where the function is completely lower bounded, used when the gradient is not available.
Subgradients can be thought of as slopes in one-dimensional cases, providing a way to approximate non-differentiable points.
The L1 penalty in Lasso has subgradients because it is not differentiable at zero, with slopes between -1 and 1.
Subgradient descent is an algorithm that can be used for convex functions even if they are not differentiable, converging to the minimum.
The Lasso problem is convex, allowing the use of subgradient descent methods for optimization.
Iterative Re-weighted Least Squares (IRLS) is a specialized method for solving the Lasso problem.
The Lasso problem combines a linear regression with an L1 penalty, promoting sparsity in the solution.
There are no closed-form solutions for Lasso, necessitating the use of optimization techniques or special purpose methods.
Ridge regression provides a closed-form solution and is easier to solve compared to Lasso.
Linear regression with squared loss is equivalent to the maximum likelihood estimator when given a probabilistic twist.
Bayesian linear regression introduces a prior distribution for the weights, leading to Ridge regression as a regularized version.
L1 regularization (Lasso) is preferred for promoting sparsity in the weight vector compared to L2 regularization (Ridge).
Elastic net penalty is a mixed regularizer combining Lasso and Ridge, catering to specific domain constraints.
Domain-specific constraints can be incorporated into regularizers to guide the optimization process towards expected solutions.
The course concludes with an overview of linear regression techniques and introduces the next part focusing on classification.
Transcripts
foreign
[Music]
so now the question is well why don't we
just do lasso right so if it's if it is
giving us
benefits in terms of uh you know the the
pushing the values to exactly zero why
would anyone even want to do rich when
lasso is easy where there are some
advantages of doing Ridge still than
lasso
foreign
I mean let me point out a few things one
thing is that one points some points um
Point number one is that lasso
does not have
a closed form solution
remember our Rich regression had a
closed form solution ah Bridge closed
from solution was x x transpose plus
Lambda I inverse X transpose y right so
it was a closed form solution whereas
lasso you cannot say that well I cannot
take the derivative set it to 0
simply because the problem itself is not
you know exactly differentiable at zero
right so your your penalty is not a
differentiable function so you cannot
take the gradient set it to 0 and solve
for it
um so how do we solve this problem
well you can use what are called as
subgradient methods
are usually used
to solve lesson
remember our gradient methods were just
following the negative gradient
direction now
for problems where differentiation is an
issue right so where you cannot take the
gradient or at all points you can still
solve the problem using an iterative
method If instead of the gradient
something called as a sub gradient
exists a subgradient is equivalent to a
gradient if a gradient is present at a
point if not it is it is a direction
where the function is completely lower
bounded in that direction right so um
so I'll just give you some examples as
to what subgradients are we won't get
into too much detail about sub gradients
but just to give give you some sense of
how subgradients look like
let's say you have a point you have some
function like this
right so this is uh this is a piecewise
linear function right so you you see
there are different pieces of this
function at different points um
now at this point
right so at this point of the input um
let us say this is X now there are two
pieces which are intersecting at this
point and the problem is there is no
it's not differentiable at this point so
there is no gradient that you can find
for this point but then a sub gradient
is like
you know some
line right so some way to approximate
this function at this point which is
completely lower bounding the function
right so if if the function is
completely lower bounded at this point
um for example this is one subgradient
maybe here is another subgradient as you
as you can see there are multiples of
gradients here for the same point
um if
if you can compute such sub gradients if
your function has such the gradients at
every point for example at this point
its a different point
the only subgradient that you will have
at this point is this itself
this line itself no other line can
completely lower bound this function at
every Point whereas at these meeting
points there might be multiple lines
right so these are called as sub
gradients
um the blue one is also called a
subgradient but then because it's
differentiable at this point it is also
called as a gradient
right so if there is only one
subgradient well typically that's the
gradient right so
um that's kind of telling you how the
function is slope you can think of it as
a slope right so in one dimensional case
um so why is it this relevant for us
because the L2 penalty has ah is of is
like this right so it's an absolute
value function absolute value function
looks like this
right so it it looks like this ah this
is X this is absolute value of x and
then you can think of it's like two
pieces of linear functions type I mean
attached at this point so it is not
differentiable at this point uh but it
has sub gradients right so for example
this line is a sub gradient this line is
a subgradient well this line itself is a
sub gradient this line itself is a
subgradient right so all these lines
have slopes and all of these slopes are
subgradients right so it has multiple
sub gradients at this point in fact the
subgradient are sub gradient will just
be ah the slope for one dimensional
function at 0 is any value between minus
one and one minus 1 is this slope one is
this slope so anybody between minus one
and one any line is aligned with such a
slope will lower bound this function
absolute value completely and its called
as subgradient
right so um so let me quickly give you
the definition of sub gradient and then
we'll close this discussion so
subgradient uh I just intuitively said
what subgradient is but um you can think
of sub gradient as a vector
some Vector in D Dimension is a
subgradient
of some function f which is r d to R
at a point
X
in r t if
for every value of Z if the function's
value is greater than
the value at f of x plus G transpose Z
minus X which just means that if you if
you linearize this function at X right
so
um then the function
um
ah linear is using the sub gradient G
then the function is completely above
this linearization right so that is what
this means right so that is precisely
what this means so this um again so if
you want the example that we saw earlier
right so we had a function like this and
at this point it's not differentiable
but then at any Z right so I can let's
say I draw this function now I take some
Z the function itself takes the value F
of Z here so this is f of Z now if this
is the slope is G then this value is
just f of x plus G times Z minus X right
so that's what this value would be and
you can see that F of Z is above this
right so it's always about this now no
matter which set I pick wherever I pick
Z well the function is always above the
blue line and if you can find such a g
that satisfies this for all Z then
that's a sub gradient
now what's the use of sub gradient well
ah here is the reason why sub gradients
are useful ah y sub gradients
well if the function
f
to minimize
is a convex function
then
sub gradient descent is an algorithm
that one can use instead of gradient
descent
and this algorithm also converges
what does this mean this means that if
your function is nice then
instead of well even if it is not
differentiable if it is nice in the
sense that it's convex even if the
function is not differentiable you can
still Converse to the minimum of this
function by moving along the negative of
the sub gradient as opposed to the
gradient of course there are multiple
subgradients possible at a given point
where it is not differentiable it
doesn't really matter which subgradient
you pick you can pick one arbitrary
subgradient and move along the negative
of that subgradient direction and then
if the function is convex then one can
argue that a subgradient descent
algorithm will also converge to the
optimal solution why is this relevant
for us well it is relevant for us
because the lasso problem is a convex
problem it is the loss function is
quadratic the norm L one Norm is a
convex function so the sum of convex
functions is convex and so the lasso
problem itself is a convex optimization
problem and providing the minimum of a
convex optimization problem you can now
use something like a subgrade in the
center
um
by no means this is the only way to
solve this lasso problem there are other
specialized methods that one can use to
solve the lasso problem
um so we won't look at those methods but
I will just point out that there are
other methods
special purpose methods
because this is a special quadratic loss
plus plus
L one penalty now you can develop
special purpose methods to solve this
problem right so for lasso
uh one example is what is called as irls
um which is the iterative
re-weighted
least Quest method
again we won't really look at this
problem this method in this course
but
essentially the idea is that we can
solve least queries easily right so Lee
square is the original linear regression
problem we know how to solve in closed
form now you can use the use that as
some kind of a black box to solve the
lasso problem and now essentially this
is you know taking advantage of the
structure in the lasso problem that it
is a linear regression problem plus a
penalty and you can do special purpose
things like irls which can be used to
solve this problem that's one thing or
you can use a general purpose of
gradient descent algorithm by noting
that our function loss function plus
penalty the lasso objective is in fact
convex it may not be differentiable but
it's convex and so you can use subgrade
InDesign methods so all these are ok
just that we have to be aware that there
is no closed form solution to the lasso
problem and that's the main summary of
this this part of the you know
discussion there is no close form
solution to lasso so we cannot you know
hope that given the data and labels we
can simply really get a close form value
for w we have to do some work and that
is important to keep in mind though you
will get a sparse solution and all that
it is not as straightforward as solving
a ridge regression problem on the other
hand is gives you a nice close form
solution
you can do a stochastic gradient descent
also if you wish if you have large data
size n large D and so on but for me for
you know small values you can just use a
closed form solution small values of N
and D you can use the closed form
solution whereas in lasso there is no
closed form solution and so you have to
rely on you know either optimization
techniques like subgrade indesant or
special purpose methods like IRS
okay so so with this uh we we kind of
conclude our discussion about uh the
regression problem
I just wanted to summarize at a high
level whatever we have seen so far in
regression
um and then we'll move on to the next
part of the course in regression we
noted that uh you know you can solve a
least squares objective which is
required loss and that had a closed form
solution we looked at its geometric
interpretation we looked at its
computational
discussion we had a discussion and we
came up with the stochastic gradient
descent algorithm then we noted that
well if you give it a probabilistic
Twist then the linear regression with
squared loss is same as the maximum
likelihood estimator now because it's a
maximum likelihood estimator then we can
ask well is there a Bayesian counterpart
to this and that gave us the linear
regression plus regularized version
which is the rich regression problem
where you used an L2 regularizer which
was equivalent to a gaussian prior with
zero mean on the W itself and then we
noted that well the while the L2
regularizer is good in pushing w W's to
close to 0 it may not make it exactly
zero now to make it exactly zero then we
looked at the geometry of the problem
and said that well maybe an L1
regularizer is better and that gave us
the lasso problem the least absolute
shrinkage selection operator algorithm
which is the loss squared loss plus an
L1 regularizer and we also had a small
discussion about how to solve the lasso
problem involving sub gradients and
other techniques perhaps like the irls
now this is the overall summary of
linear regression I also wanted to point
out that you know people use all sorts
of regular razor there are mixed lasso
plus Rich regularizer called the elastic
net penalty and so on so there is a
whole you know um you know cottage
industry of regularizers where if you
think your W should satisfy some nice
constraint maybe W has some group
structure in the sense that you collect
features from lot of places but you know
that some features should either be
positive or sorry either be selected
together or not be selected together
these These are domain specific
constraints that might come up and then
you can convert them into you know
regular razors and then try to solve the
problem in a regularized fashion all
these are things that people have
studied um what you what we have looked
at in this course is some basic type of
regularizes which are very popular which
are most commonly used but then you also
should as machine learning practitioners
you also should know that in case
if your domain knowledge tomorrow in
your problem that you try to solve you
have some domain knowledge with say
something more about the structure of
the solution that you expect to see then
you can perhaps convert that into a more
meaningful regularizer than simply using
a ridge or a lasso penalty with this we
will complete our discussion about
regression in this course and we will
move on to the next part of the course
where we will talk about supervised
learning but from a classification point
of view thank you
Browse More Related Video
Week 3 Lecture 14 Partial Least Squares
Relation between solution of linear regression and Lasso regression
Lec-4: Linear Regressionπ with Real life examples & Calculations | Easiest Explanation
Gradient Descent, Step-by-Step
Week 2 Lecture 8 - Linear Regression
Relation between solution of linear regression and ridge regression
5.0 / 5 (0 votes)