ADDRESSING OVERFITTING ISSUES IN THE SPARSE IDENTIFICATION OF NONLINEAR DYNAMICAL SYSTEMS
Summary
TLDRIn this video, Leo Alves from UFF discusses his research on mitigating overfitting in Symbolic Regression, a machine learning technique. Sponsored by CNPQ and the US Air Force, the study collaborates with UCLA's Mechanical Aerospace Engineering department. Alves explores the use of model development from first principles and data-based approaches, focusing on the challenges of convergence and error propagation when increasing system nonlinearity or state vector size. He examines the impact of regularization, sampling rates, and the condition number on model accuracy, suggesting that alternative polynomial bases may improve Symbolic Regression's performance.
Takeaways
- 📚 The speaker, Leo Alves from UFF, discusses his work on addressing overfitting issues in 'CINDY', a project sponsored by CNPQ and the US Air Force, in collaboration with UCLA's Mechanical Aerospace Engineering department.
- 🌟 The script contrasts two approaches to model development: traditional first principles and data-based approaches, with historical examples including Galileo, Newton, and Kepler.
- 🤖 The focus is on machine learning, specifically symbolic regression, which uses regression analysis to find models that fit available data, citing significant papers in the field.
- 🔍 The script explains the process of using CINDY, starting from data compression, building a system of equations, and making assumptions about the state vector size and sparsity of dependencies.
- 📈 The importance of defining a sampling rate and period is highlighted, which are crucial for building matrices and evaluating the state vector at different times.
- 🔧 The process involves creating a library of candidate functions using a polynomial representation with a monomial basis, which is a power series.
- 🧬 The script delves into the use of genetic programming and compressed sensing for identifying nonlinear differential equations that model data.
- 📉 The impact of increasing nonlinearity order on error propagation and coefficient accuracy is discussed, showing how regularization techniques like Lasso can help.
- 📊 The Lawrence equations are used as test cases to illustrate different regimes: chaotic, double periodic, and periodic, each with distinct frequency spectra and time series behavior.
- 📌 The script shows that the condition number of the candidate function matrix is a good proxy for relative error, and how increasing sampling rate or period affects this.
- 🔍 The final takeaway is the recognition of the Van der Monde structure in the library of candidate functions, which is known to be ill-conditioned, and the need to explore different bases to overcome this issue and minimize error propagation.
Q & A
What is the main topic of discussion in this video by Leo Alves?
-The main topic is addressing overfitting issues in the context of symbolic regression, particularly focusing on a method called Cindy, which is used for model development from data.
What is the role of CNPQ and the US Air Force in Leo Alves' work?
-CNPQ and the US Air Force have sponsored Leo Alves' work, indicating financial or strategic support for the research on addressing overfitting in symbolic regression.
What is symbolic regression and why is it significant in machine learning?
-Symbolic regression is a form of machine learning that uses regression analysis to find models that best fit available data. It is significant because it allows for the discovery of underlying equations from data, which can be crucial for understanding complex systems.
Who are some key researchers mentioned in the script that have contributed to symbolic regression?
-Key researchers mentioned include Lipson and co-workers, who used genetic programming for symbolic regression, and Brenton W. B. Procter and co-authors, who applied ideas from compressed sensing and sparse regression to the field.
What are some common data compression methods mentioned in the script?
-Some common data compression methods mentioned are projection methods like collocation, principal component analysis, and proper orthogonal decomposition.
What assumptions does the Cindy method make about the state vector and its relationship with the system?
-The Cindy method assumes that the state vector size is arbitrary but small, and that the dependence of the function on the state vector is sparse, meaning that each function may depend on only a few elements of the state vector.
How does the script describe the process of building a library of candidate functions for Cindy?
-The process involves defining a sampling rate and period, building matrices based on the state vector and its time derivatives, and creating a polynomial representation using a monomial basis, which includes all possible combinations of terms up to a certain order.
What is the role of regularization in the context of the Cindy method?
-Regularization, specifically Lasso in the script, is used to minimize the objective function and prevent overfitting by automatically removing terms that are deemed unnecessary, thus improving the model's generalizability.
What are the Lawrence equations mentioned in the script, and what is unique about their parameter involvement?
-The Lawrence equations are a set of differential equations used as test cases in the script. They are unique because the control parameters appear only in the linear terms of each equation, and the maximum linearity order is quadratic.
How does the script discuss the impact of increasing nonlinearity order on the performance of the Cindy method?
-The script discusses that increasing the nonlinearity order leads to error propagation and incorrect coefficients, highlighting the need for regularization techniques to mitigate these issues.
What insights does the script provide regarding the relationship between the condition number of the candidate function matrix and the relative error in the model?
-The script suggests that the condition number of the candidate function matrix is a good proxy for the relative error in the model. It increases with the nonlinearity order and the size of the state vector, indicating more error propagation and the limitations of using the Cindy method.
What is the proposed next step to overcome the limitations discussed in the script?
-The proposed next step is to use a different basis for representing the unknown system to overcome the issue of error propagation associated with the van der Monde structure of the current candidate function matrix.
Outlines
📚 Introduction to Overfitting in Symbolic Regression
The speaker, Leo Alves from UFF, introduces the topic of addressing overfitting issues in Symbolic Regression, a project sponsored by CNPQ and the US Air Force and conducted in collaboration with UCLA's Mechanical Aerospace Engineering department. The talk focuses on the use of machine learning, specifically symbolic regression, to develop models from data. Symbolic regression is compared to traditional model development from first principles as well as historical data-based approaches by scientists like Galileo and Johann Kepler. The speaker also mentions significant papers in the field, including work by Lipson and Brenton Proctor, and discusses the challenges of convergence problems in increasing system nonlinearity or state vector size.
🔍 Analyzing Symbolic Regression and Regularization Techniques
This paragraph delves into the technical aspects of symbolic regression, explaining the process of transforming nonlinear ordinary differential equations into an algebraic system using matrices and linear regression. The use of an objective function with regularization, specifically Lasso, is highlighted for its ability to automatically remove unnecessary terms. The speaker uses the Lawrence equations as test cases to demonstrate different regimes of behavior, including chaotic, double periodic, and periodic. The effects of increasing nonlinearity order on error propagation and coefficient accuracy are discussed, emphasizing the importance of regularization in managing these issues.
📉 Exploring the Impact of Non-linearity Order and Condition Number
The final paragraph presents an analysis of the impact of non-linearity order on the condition number of the candidate function matrix and the relative error in model fitting. It is shown that increasing the sampling rate or period can improve the condition number and reduce error up to a certain point, after which there is no further improvement. The chaotic condition is found to produce the smallest condition numbers and errors, which is counterintuitive but explained by the random distribution of matrix elements in chaotic systems. The paragraph concludes with insights on the van der Monde structure of the candidate function library and its implications for error propagation, suggesting that using a different basis for polynomial representation could potentially overcome these issues.
Mindmap
Keywords
💡Overfitting
💡Symbolic Regression
💡First Principles
💡Data-Based Approaches
💡Genetic Programming
💡Compressed Sensing
💡Van der Monde Matrix
💡Regularization
💡Condition Number
💡Asymptotic Behavior
💡Eigenvalues
Highlights
Leo Alves from UFF discusses overfitting issues in Cindy, a project sponsored by CNPQ and the US Air Force.
Collaboration with the Mechanical Aerospace Engineering Department at UCLA on using machine learning for model development.
Traditional model development from first principles as proposed by Galileo and Newton, contrasted with data-based approaches.
Introduction of symbolic regression in machine learning to find models that fit available data.
Citation of key papers by Lipson and coworkers, and Brenton Proctor and Cuts, that brought symbolic regression to the forefront.
Addressing convergence problems in Cindy, particularly when increasing system nonlinearity or state vector size.
Assumption in Cindy that the state vector size is arbitrary but small, and the time history in data is known.
Explanation of how to build a system of equations in Cindy with state vector size and state function.
Importance of having the derivative time derivative of data, either measured directly or approximated.
Building a library of candidate functions using a polynomial representation with a monomial basis.
Transformation of non-linear ordinary differential equations into an algebraic system for solving.
Use of an objective function with regularization, specifically Lasso, to minimize error and remove unnecessary terms.
Test cases using the Lawrence equations to analyze different regimes: chaotic, double periodic, and periodic.
Observation of error propagation and coefficient inaccuracies when increasing nonlinearity order without regularization.
Demonstration of regularization's effectiveness in eliminating unnecessary terms but challenges remaining with physical terms.
Analysis of relative error behavior and condition number of the candidate function matrix with different sampling rates and periods.
Condition number as a proxy for relative error, useful when the exact solution is unknown.
Van der Monde structure of the library of candidate functions and its impact on error propagation and conditioning.
Proposal to use a different basis to represent unknown systems to overcome issues with error propagation.
Invitation for questions and closing remarks, emphasizing the importance of addressing overfitting in machine learning models.
Transcripts
good morning afternoon or night
depending on when you're watching this
my name is leo alves i'm from uff
i'm here to talk about my work on
addressing overfitting issues in cindy
this has been sponsored by cnpq also the
us air force
and it's been doing in collaboration
this work has been done in collaboration
with
some folks at the mechanical aerospace
engineering department at ucla
so in general we're going to talk about
the use of cindy from model development
right model development has been
traditionally done as it's known in
science
modern science from first principles
this is what's been proposed by galileo
from the beginning this is what for
instance what as everybody knows
that zach newton has done when he
developed his uh
laws of motion from first principles
using calculus and his experiments
but this can also be done using
through uh data-based approaches and
this is nothing new johann kepler did
that
which is a contemporary of the previous
two people i mentioned
and when he developed his laws of
planetary motion
using data from the plant orbits
that he collected using his telescopes
and we so what we're going to focus on
is on the use of machine learning to do
that
and we're going to focus on a specific
aspect of machine learning that is known
as symbolic regression
which is nothing but the use of
regression analysis
to search for models that best fit
the available data and
so to mention a few
work a few papers on symbolic regression
that really has
brought it to the forefront of machine
learning uh
is the work of lipson and coworkers in
which they use genetic programming to
identify these
nonlinear differential equations that
model data and
more recently there's been the work of
brenton proctor and cuts
using ideas of compressed sensing and
sparse regression
to be able to do the same but based on
linear regression
there's a lot of more work based on the
stuff that these guys have done
but i'm just citing here the main papers
the original papers in which they
developed those
but so we're going to focus on this work
on cindy
and specific issues associated with each
which is
convergence problems that you have with
it and
they happen usually when you try to
increase the nonlinearity order of your
system
they also happen when you increase the
vector state size in your system
and so i'm going to try to address those
issues here
so basically we assume that you have
some
data there is like a compressed data set
that represents
your problem right that compressed data
can be compressing the data can be done
in different ways
um for instance with projection methods
like colurking
you can use principal component analysis
or for instance like
proper tonal decomposition dynamic
multicoop is a number of ways in which
you can do that
and then you can build a system of
equations like
that in which you have your state vector
size and you have your state function
and so cindy there's a few assumptions
behind it
and the main one is that your state
vector
size is arbitrary but it's a small right
so
n is arbitrary but small and you know
the time history in your data so do you
know
how these guys vary with time and you
assume that
whatever dependence f has on
your state vector x is this
is although it's not known is very
sparse
for instance f1 depends only on x1 and
x2 but not on the others and so on
so cindy the way the first thing you do
is you define a sampling rate
m and a sampling period tau and
um which is you know depends on the
initial
and final time that you use to where to
extract your data from
then you build your matrices and you can
you have your state size state vector
and then you evaluate it at the
different times for which you have that
data
and you build that matrix but what you
really need is the derivative time
derivative of that data
and you can either measure that directly
or approximate it numerically from
the original data that you have then you
build your library of candidate
functions
and the way you do that is is usually is
a polynomial representation
using a monomial basis which is nothing
but a power series
you can see here you have 1 x x
squared and so on right and then all
possible combinations for instance
quadratic terms not only have to be
x1 squared but also x1 times x2 and so
on and you do that up to
whatever order that you want to use that
we call the highest value of which we
call p
and that will give you a possibility of
terms that you can combine to fit your
data
at having q terms and then you of course
you're going to have
coefficients that are going to be in
front of each one of these terms
that you use to create your coefficient
matrix right then you can propose
transform that non-linear
ordinary differential equations that we
talked about before
which is here you can transform that
based on these matrixes on like an
algebraic system
and you can do that for each lines in
your coefficient
matrix or your uh the state
vector derivative matrix and you solve
it in stages
and of course and you do linear
regression to find out which
coefficients you need to put in front of
which
these terms in order to fit the data for
x dot
and you do that also but to do that of
course you need an objective function to
minimize
and in this case is essentially the
left-hand side minus the right-hand side
and we included here some regularization
in this case the specific one is lasso
which is a well-known one that is nice
because it
takes it removes terms that it deems
unnecessary automatically for you
so the test cases that we're going to
use are going to be done the lawrence
equations
interesting thing about it is that the
control parameters appear only on the
linear terms in each equation
sigma rho and beta and these equations
that the maximum
linearity order in them is quadratic
right you have x times z and x times y
here
there are three typical scenarios that
we can analyze
one is a chaotic regime the phase of the
parameters to get those the ones that we
used here at least
given here so you can see like a broad
band of frequency spectra here
and that explains that the time series
behavior that you see on the left
and there's a double periodic regime in
which you get
two dominant uh frequencies in your
specular the corresponding time series
is on the left
and there's a periodic regime when you
have a single dominant frequency that
controls the behavior
um so of course i'm talking about here
asymptotic
behavior so for very large times so
you're ignoring the data for the early
times
in which you can have linear growth of
disturbances and whatever transient
behavior
before you reach this asymptotic trends
that i just showed
um so the the first results i'm going to
look into is without regularization
we're going to look at the values of the
three main parameters that's been
increased in non-linearity order
and i need to note here that one doesn't
mean linear approximation
it just means like that and we use an
inverse problem type of approach
in which i feed to my library of
candidate functions exactly
the same terms that appear in the
lawrence equations although the
coefficients in the parameters in front
of each one are not yet known
and so as an increase in nonlinearity
order
we can see that we have a lot of error
propagation these results in the first
line are bang on
like across machine precision accuracy
uh with the ones used to generate the
data but that's the increase in only
entire the order
the there's a lot of error propagation
and the coefficients become
very very wrong and also not only that
i'm illustrating here the case of the
fifth order ones we have this has been
normalized by the highest coefficient
which you can see
is when multiplying x in the equation
for i which is exactly
rho so that's why the maximum one is one
here
and you can see there's a lot of terms
that appear in your equation that should
be
zero but are there although even for
instance the high order ones are small
nothing steady in you really that it
doesn't exist and it's supposed to be
small or if it shouldn't exist at all
so this is the reason why people use
lots of regularization because it knocks
off those terms
it does a pretty good job of eliminating
most of those terms but still
you still have on physical terms in in
whatever model that you get back
independent of whatever value you use
for the regularization parameter
this has been done in the previous case
for the doubly periodic case that's why
rho is 165 and here for rho equals 28
is the chaotic case but the similar
trend happens for all three cases
so then we move on to looking into the
how the relative error behaves and also
associate with the condition number of
the library of the matrix of the library
of candidate functions
and we can see that if we increase the
the sampling rate
after some point it does nothing to
change the condition number of our
matrix
and it doesn't do anything to reduce the
relative error either
also if you increase the period uh the
sampling period
after it does decrease the condition
number and the error
but after a certain point it doesn't
improve the results any further
so you can tell that the condition
number is a pretty good proxy for the
behavior of the relative error
which is a very good thing because in
this particular case we generated our
data from the largest equation so we
know the exact solution
but in general we do not so it we can't
you really calculate this guy
so knowing that the conditional number
which you can calculate for any problem
is a good proxy for this behavior is a
good thing just noting that for very
small sampling
sampling rates the conditional number
decreases because the matrix size
decreases but
we don't have enough data to create a
proper model for our data that's why the
error is still
large so if you understand that then you
can use the conditional number as a
proxy for your relative error
and in this last plot you can summarize
our results and we show that as we
increase the non-linearity order
the conditional number of the matrix of
associated with the library of candidate
functions
increases which means there's more error
propagation which is why the relative
error also
increases a less intuitive thing
is that the chaotic condition is the one
that produces the smallest
condition numbers although it still
increases the non-integrity order
and also largest errors smallest errors
which is sort of counterintuitive but if
you understand that
um the fact that the solution is chaotic
probably moves that matrix towards more
of our like
its elements having a random
distribution of values and we know
that matrix that matrices that are
generated with whose elements are
regenerated randomly
they have very small conditionals you
can prove that so maybe
by by being a chaotic behavior or moving
towards that scenario that's why the
condition number decreases
and also a way to approximately
calculate the condition number is the
ratio between largest and smaller
smallest eigenvalues and as you go to
periodic and double periodic conditions
the eigenvalues move away from the unit
circle so that ratio becomes larger
which is probably why the condition
number also increases in those cases
so to summarize um
it turns out the library of candidate
functions has a vundermond
structure right and van der monde
matrices are known to be u-conditioned
and the condition number of those
matrices actually increases as the size
of the matrix increases
and by increasing the non-linearity
order
and we haven't done that but by proxy
by increasing the state vector size we
increase the size of that matrix
and then as a result we're going to have
larger matrix larger matrix means more
eu conditioning
because it's a van dermont type matrix
so we are going to have more
error propagation and so which limits
our ability to use cindy
the reason why these things happen is
because
we have chosen a monomial basis for a
polynomial representation
and so
as a next that's why it has a van der
man type of structure it's it's a
well-known
fact from interpolation theory and
numerical analysis
so the next step moving forward is to
try to use a different basis and we're
gonna we are currently trying
or talking all bases to represent our
unknown system to try to overcome this
issue
and minimize error propagation so thank
you for your time if you have any
questions please place it in the chat
and i'll get back to you thank you so
much and take care
Посмотреть больше похожих видео
Tutorial 43-Random Forest Classifier and Regressor
Different Types of Learning
XGBoost's Most Important Hyperparameters
#10 Machine Learning Specialization [Course 1, Week 1, Lesson 3]
Machine Learning Explained in 100 Seconds
Quantifying the Impact of Data Drift on Machine Learning Model Performance | Webinar
5.0 / 5 (0 votes)