Liquid Neural Networks
Summary
TLDRThe video script features a CBMM talk with Daniela Rus, director of CSAIL, and Dr. Ramin Hasani, where they introduce the concept of 'liquid neural networks.' These networks, inspired by neuroscience, aim to improve upon traditional deep neural networks by offering more compact, sustainable, and explainable models. Hasani discusses the limitations of current AI systems, which rely heavily on computation and data without fully capturing the causal structure of tasks. He presents a new approach that integrates biological insights into machine learning, resulting in models that are more expressive, robust to perturbations, and capable of extrapolation. The talk also covers the implementation of these models using continuous-time processes and the potential applications in real-world robotics and autonomous driving.
Takeaways
- 📚 Daniela Rus introduced the concept of bridging the natural world with engineering, focusing on intelligence in both biological brains and artificial intelligence (AI).
- 🤖 Ramin Hasani presented Liquid Neural Networks, inspired by neuroscience, aiming to improve upon deep neural networks in terms of compactness, sustainability, and explainability.
- 🧠 The natural brain's interaction with the environment and its ability to understand causality were highlighted as areas where AI could benefit from biological insights.
- 🚗 Attention maps from AI systems were discussed, noting differences in focus when driving decisions are made, with an emphasis on the importance of capturing the true causal structure.
- 🔬 Hasani's research involved looking at neural circuits and dynamics at the cellular level to understand the building blocks of intelligence.
- 🌐 Continuous time neural networks (Neural ODEs) were explored for their ability to model sequential behavior and their potential advantages over discrete representations.
- 🔍 The importance of using numerical ODE solvers for implementing these models and the trade-offs between different solvers in terms of accuracy and memory complexity were discussed.
- 🤝 The integration of biological principles, such as leaky integrators and conductance-based synapse models, into AI networks to improve representation learning and robustness was emphasized.
- 📉 The expressivity of different network types was compared, demonstrating that liquid neural networks could produce more complex trajectories, indicating higher expressivity.
- 🚀 Applications of these networks were shown in real-world scenarios like autonomous driving, where they outperformed traditional deep learning models in terms of parameter efficiency and robustness to perturbations.
- ⚖️ The potential of liquid neural networks to serve as a bridge between statistical and physical models, offering a more causal and interpretable approach to machine learning, was highlighted.
Q & A
Who is the presenter of today's CBMM talk?
-The presenter of today's CBMM talk is Daniela Rus, the director of CSAIL.
What is the main focus of Daniela Rus' research?
-Daniela Rus' research focuses on bridging the gap between the natural world and engineering, specifically by drawing inspiration from the natural world to create more compact, sustainable, and explainable machine learning models.
What is the name of the artificial intelligence algorithm that Ramin Hasani is presenting?
-Ramin Hasani is presenting Liquid Neural Networks, a class of AI algorithms.
How do Liquid Neural Networks differ from traditional deep neural networks?
-Liquid Neural Networks differ from traditional deep neural networks by incorporating principles from neuroscience, such as continuous dynamics, synaptic release mechanisms, and conductance-based synapse models, leading to more expressive and causally structured models.
What are the advantages of using continuous time models in machine learning?
-Continuous time models offer advantages such as a larger space of possible functions, arbitrary computation steps, the ability to model sequential behavior more naturally, and improved expressivity and robustness to perturbations.
How do Liquid Neural Networks capture the causal structure of data?
-Liquid Neural Networks capture the causal structure of data by using dynamical systems that are inspired by biological neural activity, allowing them to understand and predict the outcomes of interventions and to perform better in out-of-distribution scenarios.
What is the significance of the unique solution property in the context of Liquid Neural Networks?
-The unique solution property, derived from the Picard-Lindelof theorem, ensures that the differential equations describing the network's dynamics have a unique solution under certain conditions, which is crucial for the network's ability to make deterministic predictions and maintain stability.
How do Liquid Neural Networks improve upon the limitations of standard neural networks?
-Liquid Neural Networks improve upon standard neural networks by providing a more expressive representation, better handling of memory and temporal aspects of tasks, enhanced robustness to input noise, and a more interpretable model structure due to their biological inspiration.
What are some potential applications of Liquid Neural Networks?
-Potential applications of Liquid Neural Networks include autonomous driving, robotics, generative modeling, and any task that requires capturing causal relationships, temporal dynamics, or making decisions based on complex data.
What are the challenges or limitations associated with implementing Liquid Neural Networks?
-Challenges or limitations associated with Liquid Neural Networks include potentially longer training and testing times due to the complexity of ODE solvers, the possibility of vanishing gradients for learning long-term dependencies, and the need for careful initialization and parameter tuning.
How does the research presented by Ramin Hasani contribute to the broader field of artificial intelligence?
-The research contributes to the broader field of artificial intelligence by proposing a new class of algorithms that are inspired by neuroscience, which can lead to more efficient, robust, and interpretable AI models. It also opens up new avenues for research in understanding intelligence and developing advanced machine learning frameworks.
Outlines
🎉 Introduction to CBMM Talk and Daniela Rus
The presenter warmly welcomes the audience to a CBMM talk featuring Daniela Rus, a renowned director of CSAIL and a significant contributor to robotics. Daniela is recognized for her innovative ideas in robotics and AI, which are often featured in tech news. She is also known for her interest in the brain's problem, not just AI, and her role as an advisor for the presenter. Daniela introduces Dr. Ramin Hasani, who will lead the presentation on a new idea that aims to bridge the natural and engineering worlds by creating more compact, sustainable, and explainable machine learning models.
🧠 Bridging Neuroscience and Machine Learning
Dr. Ramin Hasani begins by expressing his excitement to present 'liquid neural networks,' a class of AI algorithms that integrate neuroscience principles into machine learning. He contrasts brain activity patterns with those of a trained network controlling an autonomous car, highlighting the similarities and fundamental differences. Ramin emphasizes the importance of understanding the causal structure of tasks, the robustness of natural brains, and the efficiency of neural models. He demonstrates a typical statistical machine learning system and discusses the limitations of convolutional neural networks (CNNs) in capturing the true causality behind driving decisions. The talk then shifts towards improving these models by incorporating insights from neuroscience.
🔬 Liquid Neural Networks and Their Expressiveness
Ramin Hasani explains the concept of liquid neural networks, which are inspired by the nervous systems of small species and operate on continuous dynamics described by differential equations. These networks are shown to be more expressive than traditional deep learning models, capable of handling memory and capturing the true causal structure of data. They are also robust to perturbations and can be used for generative modeling and extrapolation. The presentation includes a detailed discussion on how these networks are created, starting from the interaction of two neurons and the synaptic propagation between them.
🚗 Implementing Neural ODEs for Autonomous Driving
The talk delves into the practical implementation of neural ODEs, particularly in the context of autonomous driving. Ramin outlines the process of using numerical ODE solvers to implement these models, including the use of explicit Euler solvers and adjoint sensitivity methods for backpropagation. He also discusses the challenges of implementing these models in real-world applications and the need for improvement by drawing inspiration from biological processes, such as the leaky integrator model and conductance-based synapse models.
🤖 Liquid Neural Networks in Robotics and Decision Making
Ramin Hasani discusses the application of liquid neural networks in robotics, particularly in decision-making processes. He shows how these networks, with their dynamic causal structures, can outperform statistical models and physical models in tasks requiring temporal data processing. The talk includes an empirical analysis of the networks' performance in various tasks, including physical dynamics modeling and autonomous driving. Ramin also addresses the limitations of these networks, such as complexity tied to ODE solvers and potential issues with vanishing gradients, and suggests solutions like using gating mechanisms to preserve gradients.
🌟 Conclusion and Future Perspectives
In conclusion, Ramin Hasani emphasizes the potential of liquid neural networks, which combine elements of computational neuroscience and machine learning, to perform inference model-free, capture temporal aspects of tasks, and improve decision-making. He suggests that these networks can be composed and connected in various architectures, making them highly versatile. Ramin also highlights the importance of focusing on how brains acquire knowledge to narrow down the vast research space in AI. He encourages further exploration of these networks for complex tasks and mentions the open-source availability of the technology presented.
Mindmap
Keywords
💡Artificial Intelligence
💡Machine Learning
💡Neural Networks
💡Liquid Neural Networks
💡Neuroscience
💡Causal Structure
💡Sustainability
💡Explainability
💡Deep Neural Networks
💡Computational Models
💡Autonomous Driving
Highlights
Daniela Rus introduces a new idea in robotics that aims to bridge the gap between the natural world and engineering.
Dr. Ramin Hasani presents liquid neural networks, inspired by neuroscience, for structured machine learning.
Liquid neural networks demonstrate similarities in activation patterns to natural brain activity.
The research explores the fundamental differences and gaps between intelligence in natural brains and deep learning models.
Natural brains' interaction with the environment and causal understanding is a key area for improving AI.
Brains' robustness and flexibility, especially in perturbations, is highlighted as an aspect to emulate in AI models.
Efficiency in neural models is emphasized, noting that not all parts of a network are always active.
Attention maps from CNNs reveal a learned focus on the sides of the road, not the actual causation for driving decisions.
Adding noise to images significantly impacts the decisions made by conventional CNNs, demonstrating a lack of reliability.
Neuroscience can improve AI by incorporating system-level goals and mechanisms from biological systems.
Liquid neural networks are proposed as more compact, sustainable, and explainable models than deep neural networks.
The expressivity of liquid neural networks is theoretically and empirically evaluated, showing higher trajectory lengths.
LTC networks outperform other models in tasks requiring temporal data processing and have better inference capabilities.
Liquid neural networks are robust to perturbations and can be used for generative modeling and extrapolation.
The research successfully implements liquid neural networks in real-world robotics and autonomous driving.
The attention maps of liquid networks are more focused on the true causal structure of tasks compared to traditional CNNs.
The integration of biological principles into machine learning leads to improved representation learning and model robustness.
The technology and research behind liquid neural networks are open source, available for further exploration and development.
Transcripts
PRESENTER: So welcome to today's CBMM talk.
It's great, really great, to have Daniela Rus coming here.
She's, of course, the director of CSAIL, a great leader.
I think you all know her.
And from time to time, she has these great, wonderful, simple,
beautiful ideas in robotics, which
we read in papers and in the news, in the tech news.
And she's also a great friend of CBMM,
has been a great advisor for me.
And it's somebody who really likes the problem of the brain
and not just artificial intelligence,
although artificial intelligence, of course,
is also a great problem.
DANIELA RUS: Thank you for this kind introduction.
It's really a great pleasure to be here
to share some of our ideas with the CBMM community.
And so today, we will tell you about a new idea
we have been pursuing, together with Dr. Ramin Hasani, who
will present most of the talk.
And the basic idea we want to describe with you
aims to bring the natural world and the engineering world
closer together.
And Ramin and I are going at this problem,
in part because we have a general curiosity and desire
to understand intelligence, in part
because when I look at the state of the art in the field
of artificial intelligence, I see a lot of advancements.
And I see that these advancements are really
using decades-old ideas that are enhanced
by computation and data.
And so natural question is whether this is intelligence.
Another question is, are there other ideas?
Can we use the natural world to inspire
us to think differently?
Because I believe if we don't come up with new ideas,
then our results are going to become increasingly more
incremental.
Because more and more people will be plowing the same field.
And so the field really desperately needs
some new ideas.
And the idea that Ramin will describe today
aims to build machine learned models
that are much more compact, much more sustainable, and much more
explainable than the models that are
based on deep neural networks.
And so let me just say that much.
And now, it is my great pleasure to introduce more formally
Dr. Ramin Hasani.
Ramin is a postdoc in my group.
Prior to joining my group, he was a PhD student
at the Technical University in Vienna.
And prior to that, he did his master's degree
at Politecnico di Milano.
And so with that, Ramin, please join us and tell us
about your vision and results.
RAMIN HASANI: So hi, everyone.
Thanks, Daniela, for the introduction.
And thanks, Professor Poggio.
All right, I'm very excited to be here,
presenting liquid neural networks, a class
of artificial intelligence algorithms
that tries to bring a little bit of neuroscience
in a structured way to machine learning.
So if you look at neural activity in brains, in general,
on the left side, you see the brain activity of a mouse,
and on the right side, you see one of the networks
that we trained end to end--
a controller for controlling an autonomous car.
We see that, basically, the activation
of the patterns and activations maybe,
superficially, look very similar.
But in principle, there are fundamental differences.
There are huge gaps between intelligence
as we know them in brains compared to deep models,
in particular, representation learning capacities--
how natural brains actually approach the organization
of the world around them to make use of them,
to be able to control them to achieve their goals.
So we know that natural brains interact highly
with their environments in order to understand their world.
So by understanding-- I mean when they can actually interact
with the world and to capture causality, basically,
like the causal structure of the task that they are performing.
And this is one of the reasons where natural brains can
actually go out of distribution, where statistical machine
learning, by definition, will stay in IID, right?
And this is one area that would be extremely beneficial if we
can explore more and maybe bring some of those insights
from natural brains back to artificial intelligence.
And at the same time, we know that brains
are much more robust and much more
flexible in terms of a perturbation
or environments that they are getting into.
And finally, efficiency of the models.
So a network is not always active,
so there is always some part of the network that
is taking care of the computations that is on demand.
So allow me to demonstrate this kind
of a typical, statistical end-to-end machine learning
system, so where you have inputs that are from camera inputs.
And then you have a deep neural network
that is take care of the, let's say, steering angle of a car.
So in this kind of framework, what we are seeing,
we are seeing the activity of the network.
And we see that this network is actually
real work tested on a real car.
And these are demonstrations from the test
set, where they are actually deployed in the environment.
They have been trained using human data,
and they are now deployed.
So one of the things that we actually
looked into is, basically, the attention of this network,
like what kind of representation has been learned?
What pixels are the most important pixels when a driving
decision is being made?
So this CNN actually learned to attend
to the sides of the road, where we see lighter
regions in this attention map, in order
to take driving decisions.
And that's not a actual causation.
When you're driving, you're not just looking around, right?
You're looking into the road and in front of you.
So you want to actually have your focus on that perspective.
So the causal structure here is missing,
although the task is being completed by the network.
Now, if you add some noise on top of the image,
like a little bit of noise, we see that this attention map
is not even reliable anymore.
Even if this noise is kind of a small Gaussian perturbation,
you can see that it has huge influence on the decisions
and the consistency of the decisions
that the network makes.
So how can we improve this by bringing neuroscience in.
As Marr and Poggio said and set up a framework for us
for actually creating--
let's say, if you want to explain a biological system,
you want to say, at a system level,
you can look at it from a system level
and find out, what are the goals of the system
and what are the kind of mechanisms
that, actually, you get to the goals, that's the system level.
And then you can also have this view
of looking into building blocks of these things,
going down and looking into how intelligence
emerges from cells.
You can go down and basically use
computational models, precise mechanisms that
exist in biology.
So having this kind of framework in mind, what we can do--
and that's what we did, just showing you
an outline of how this research is a summary of what
this research is about.
So we looked into nervous system of a small species.
And we got down into neural circuit level.
And even for understanding neural circuits,
we actually went into the neuron and synapse level
even further to explain, to really fundamentally figure
out, what are the building blocks there.
And you know that you can even go lower
than that and computational model down to atoms.
But there is actually a level that you
have to satisfy yourself that you don't want to go below
that in order to actually get there and then take this model
and see what kind of capabilities
you can have using the engineering,
super-advanced machine learning frameworks that recently got
developed.
So we stopped at a certain level,
which I'm going to explain throughout the talk.
And we saw that these models are much more expressive
than their compartments in deep learning,
although the kind of abstraction that we did is really simple.
But in terms of how much capacity
these networks can generate, they are much more expressive.
And I'm going to show you the math behind
and also the experimental evidence for that.
These systems can handle memory, and these systems
can handle explicit and implicit memory mechanisms
that I will explain throughout the talk.
More importantly, these systems can capture the true causal
structure of the data.
And that's part of the reason why these systems actually
can be helpful in those kind of this closed-form, real-world
decision-making processes.
The systems are basically robust to perturbations.
And we can use them for generative modeling.
We can even use them for extrapolation.
You can go out of distribution with these type of networks.
Because if some process can capture the causal structure
of the data and you can prove that that's the case,
then the system is being able to actually go
even out of distribution.
And with that in mind, we actually
try to perform decision making in real-world robotics.
We are distributed robotics lab, and we
want to bring these insights into the brains.
Now, to show you what kind of change we have done,
you can look at this system.
This system has now, on the right-hand side,
what you see is the 19 nodes of the system that
is sparsely connected together.
And this is described by that model
that, actually, we developed.
And then you can actually get into attention maps that
are much more focused on the true causal structure
of the task.
And this is not just on this task.
But we can actually see more throughout the talk.
Well, how do you get started for creating a model?
Let's look into the, let's say, interaction of two neurons
and the synaptic propagation between information propagation
between the two.
So neural dynamics are typically given--
unlike deep learning systems--
they're given with continuous processes.
And they are described by differential equations.
So synaptic release is not just the scalar rate.
So synaptic release can be modeled
with much more sophisticated kind of mechanisms.
So you can really get down to probability
of if a neurotransmitter is actually
going to stick to the receptors of the second neuron.
So you can really get into the process, how much complexity.
You can really add nonlinearity to the system.
And there are also recurrence in the structure, there's memory,
and there is a sparsity all over the place in neural circuits.
So having these principles in mind,
the goal is to actually incorporate
these small principles that I mentioned
into improving representation learning, improving
the robustness of machine learning
model and the statistical models, and, at the same time,
improving their interpretability.
So to get into a common ground between the computational work
of neuroscience and the machine learning systems,
I would like to start exploring where
do we have continuous dynamics.
So let's start with these processes that has been
recently brought up-- continuous time,
or continuous steps models--
in the machine learning community.
So a continuous time neural network
is basically when a neural network f that
has certain number of layers, has certain width,
it has activation function of choice.
And it is a function of its hidden state, its inputs.
And it's parameterized by parameters data.
So if a neural network f parameterizes the derivatives
of the hidden state, then you would have a continuous time
process.
Now, it's going to be a continuous time neural network.
With this representation, you can
go from a discrete computational graph,
like in residual networks that we have.
Like, you would actually take a computation step each layer.
Now, if you define your system like the way we show it here,
the depth dimension of your system becomes continuous.
And when you have a continuous-time system,
then you would have a lot of advantages.
First of all, the space of possible functions
that you could actually explore and generate
is much more than that of the discrete representations.
Second advantage is the arbitrary computation.
So you don't need to perform computation at every time step.
You can have arbitrary step time computation.
So your depth becomes very variable, basically.
So it can be infinitely depths kind
of networks with one process.
And this would naturally, this continuous process,
would be a natural fit for modeling sequential behavior.
So let's say, compared to the normal recurrent neural
networks that you know, the updated state
of a neural network is actually given with this discretization.
If you have a neural ODE and, basically,
a more stable version of that where it has a damping factor,
then you can use this also as a recurring neural network.
On the top row, you see the interpolation and extrapolation
capability of a recurrent neural network
on irregularly sampled data that are put around the spiral.
And we see that the red line in between is actually
extrapolation capability of this model,
where it cannot actually capture the dynamics very well.
But on the bottom row, you would actually
see that the dynamic process generated by a continuous time
recurrent neural network actually captures
those dynamics properly and even extrapolates to that.
So this is nice.
Now, how do we implement these things?
I'm just going through the details
of how to implement these type of models.
So you basically, you want to, actually,
because they are ODEs, you want to use numerical ODE solvers.
So you basically unroll this difference.
And then you can use any type of numerical ODE.
So let's say we use an explicit Euler solver.
And then, there, you can actually
create the forward path of your network
based on this unrolled version of your network.
And then, choice of these ODE will actually
define the complexity of your map.
You can use a more complex adaptive solvers
that has adaptive step sizes to have
a more accurate forward path.
How do you now do backward paths?
You can use a mathematically known adjoint sensitivity
method, where, let's say you have a loss function,
and your dynamic is given by a neural ODE.
So your loss function, basically,
if you have the dynamic of your system starting from t0, given
by this time, and you have labeled data,
you can compute the output dynamic to compute a loss.
And this loss is getting computed
by running this ODE solver which basically give you
this trajectory.
And then, the adjoint method actually
creates a new state, an auxiliary differential
equation, that connects the dynamics of the loss in respect
to the state of the system.
And then you can run this ODE backward one step
at a time to get the gradients of the loss in respect
to the state of the system.
And at the same time, you would be
able to also get the gradient of the loss
in respect to the parameters of the system.
So this adjoint sensitivity method on the backward path
would give you a constant memory propagation.
Because it actually forgets the previous states
and it just do one step at a time computation.
When it does back propagation,
You can also train this network for backpropagation
through time, gradient base.
And what you do, you perform one forward pass, and then you
compute the derivatives of your--
based on the chain rule, you can actually
compute your derivatives.
And you can update your parameters.
This way, you are actually not treating the solver
in a black box manner.
So you are actually going through the solver.
So the dynamics of the solver becomes part of your gradient,
as well.
So you need to be careful about that.
But at the same time, the memory complexity of this method
is really high.
But it is much more accurate than the adjoint method
if you use it in a vanilla sense.
So I told you how these models are getting implemented forward
and backward.
Now, we have this neural ODE.
So we said the continuous-time processes,
and this representation actually can have a spatiotemporal kind
of data processing powers.
And it actually has a really good potential.
But we didn't define any biological process there.
We didn't actually get any inspiration
from the biological insights that I talked before.
And a really funny fact is that when you deploy them
in real world, they're even worse
than a simple long short-term memory network, right?
So basically, what's the point, right?
If you define a really fancy equation they cannot even work
in real-world applications very well,
then what are we even doing?
So let's improve.
Now, by this improvement, what we want to do,
we want to get into biology.
I told you that activity of neurons
are described by differential equations.
And you can actually model the dynamics of a cell
or of a membrane as a leaky integrator
and with these simple linear dynamics.
And the more important part is the conductance-based synapse
model, where you can have a nonlinearity included
in the synapse of the system and not
in the neurons of the system.
So basically, the interaction between two nodes
or two differential equations is given by a nonlinearity.
And this is what is inspired by channel modeling
behavior of Hodgkin and Huxley when they did channel
modeling of ion channels.
So you can actually get into this kind of a steady state
behavior from those differential equations of Hodgkin-Huxley.
You can reduce them into this abstract form.
And if you want to bring it, the nonlinearities
look like a sigmoid and activation function.
So you actually can, in principle,
bring neural networks, inside artificial neural networks,
in the representation of a synapse.
Now, putting these two systems--
very simple things, has been there for over a century--
together, you will get a dynamical system of such.
And this dynamical system has certain properties
and certain advantages.
It's obviously a neural ODE.
It's an ODE-based neural network.
It has a component neural network
f and nonlinearity that appears in the coefficient of x
of t, or a state of your system, and in the state of the system
itself.
So there is a coupling between the state
and the time constant of your differential equation.
So at the same time that f for that linear--
let's say I don't have recurrent connections.
So x of t in that f is 0.
Then f becomes only a function of I,
or the inputs of the system.
Then the whole system becomes a linear system.
Now, if you have that linear system,
the coefficient of x of t is input-dependent.
So if the inputs of the system is changing,
then the kind of behavior of the differential equations changes.
Because that defines the damping factor
of your very simple neural network that you have
and very simple dynamical system that you have.
So just to show you a block diagram, like how
does it look like, in a standard neural network,
the range of possible connections
that you might have is basically you can have--
let's say you have two neurons.
They have activation function.
You might be able to have reciprocal connections.
You might have feedback.
You might have an external input to the system,
and they have their own scalar weights.
Now, in a liquid network, you would
have the same kind of a structure
but, at the same time, you have a nonlinearity
that controls the interaction of two differential equations.
So the difference here is that activations are changed
to differential equations.
And their interactions are given by a nonlinearity
that can be a neural network.
So in terms of what does it represent,
let's say I trained a neural network for driving,
for autonomous driving, from visual data.
I'm showing the visual data in the middle.
I did that with a standard neural network that
has a constant time constant.
And I did that with a liquid network.
What we are seeing on the x-axis is 1 over tau.
That means 1 over the time constant of the system.
And on the y-axis, what we see is the steering angle
of the car.
And the color shows left for blue and yellow
for turning right.
And in the middle, you have the middle part.
So now, we see that a neuron actually
learned to associate its behavior, its timing behavior--
without any prior, just to plug in those very simple building
blocks together--
actually learned to associate the dynamics
of the task to its behavior.
So that's one of the advantages that you receive
from these type of networks.
Another property of these networks
is that the state of these systems are stable.
And their time constant and their behavior is stable.
So if you define the time constant
of the system as that expression that
is the coefficient of x of t, or the hidden states,
then you can actually write that down
as relaxing for not having a recurrent connection.
Let's say, x of t is out.
Then you would be able to bound the time consent of the system.
And these are actually the bounds that you can have.
So the network cannot go unstable.
You can also bound the state of the system.
Let's say a neuron is receiving many synaptic connections.
A, in this representation, is a synaptic parameter,
and its synapse is specific.
So each synapse has a bias, or has
an A, that actually has a connection to this neuron.
And now, basically, you can say the maximum of the A parameter
would be the maximum amount that your state can actually reach.
And the minimum of that, the one that has the least one,
actually has the least amount of impact on your activity
of your differential equation.
We can also show that this biologically inspired system
is actually a universal approximately.
You can actually do a function approximation, use
those methods, actually, to prove that, actually,
this expression can approximate any given
dynamics with arbitrary precision given
in number of their cells.
But to truly, actually find out how expressive
is a neural network from the theoretical standpoint,
we want to get down to a more fine-tuned expression.
So for example, there are more measures
of expressivity of neural networks
that we can use for measuring expressivity of a network--
for example, the trajectory lengths.
Imagine I have a circular trajectory,
and I input this circular trajectory
to a deep neural network.
I'm just defining what is this trajectory length measure.
You input this to a neural network.
This neural network is parameterized.
And then we can observe that, at every layer of the network,
this trajectory gets deformed, gets
more complex and the lengths of the trajectory getting
more complex and complex.
And it actually increased exponentially.
You can measure that length of this trajectory
with an arc length measure.
And you can actually find the lower bound
for the expressivity of the neural network.
Given its depth, you can actually
measure the expressivity of a neural network
by its parameterization, properties
of its synaptic parameterization,
the width of the network, and the depth of the network,
basically.
So we actually did use this expressivity measure.
Because this actually draws a boundary
between shallow networks and deep networks.
The deeper you get, the more expressive
you can get based on this measure.
Now, in our space, we have continuous-time processes,
let's say, liquid time constant networks, or LTCs.
We have continuous time neural networks.
And we have neural ODE representations.
Now, if we give the same neural networks--
we parameterize this neural network
f for all of these processes, given their representation
of differential equation--
we see that we consistently get longer and more complex
trajectories out of the LTC network.
Now, we systematically analyzed this in an empirical fashion,
where we changed, basically--
like, on the x-axis, you see different types of ODE solvers
for these three types of networks.
Neural ODEs, CTRNNs, and LTCs.
And we see that the yellow line actually
shows the trajectory lengths.
For these LTC networks, we see that, even
if you change the width of a network, on the x-axis,
you see that the trajectory length is always higher.
And we can see that if the initialization of your network
is actually changing, you also have a dependency on that.
Now, we also figured out, theoretically,
lower bound for expressivity of, basically,
these type of networks where the lower bound
is a function of weighted scale, biases scale, width
of the network, depth of the network,
and number of discretization steps
that you're taking for your ODE.
And we also implemented that for LTCs.
You cannot compare lower bounds to say that, yeah,
so this network is more expressive than the other one.
But it's just a good measure to just see
where are we standing in terms of this type of behavior.
Now that we have this type of measure and theoretically
evaluated them, let's really put these networks in action,
and let's see how good they are in representation learning.
So one of the things we start with
modeling physical dynamics.
When I told you that a neural ODE cannot beat an LSTM
network, you see that here.
And you see that we can actually get better performances
while using these networks.
You can compare them across a large series of advanced RNMs.
And this [INAUDIBLE] inspired network is actually
beating them even in person activity in a real example,
just to perform, in irregularly sample data.
We
Also performed some analysis on some real-world examples.
And we saw that, on most of these tasks, LTCs are better.
For example, one task is LSTM is better,
and that's the task where we have longer term dependency.
And that's one of the issues that you
have to solve gradient propagation
in continuous-time processes is problematic.
So you always have to take care.
If you actually wrap them inside a kind of well-behaved gradient
propagation, then you would be also getting
a better performance there.
We didn't stop there.
And we actually scaled the applications
to this end-to-end autonomous driving that, at the beginning,
I showed you.
We have human-collected data.
And we trained deep learning models.
Typically, a deep learning pipeline actually
looks like that when you want to have
a set of convolutional heads.
And then you would have fully connected networks that has,
basically, the over-parameterization part
of their network is actually there, in the hidden layers.
Between five to 100 million parameters
it takes to actually perform lane-keeping,
or this type of task, if you have this type of networks.
What we did, we said that let's replace
the fully connected networks by continuous-time processes,
and let's see what kind of behavior we get.
So we get four types of variance.
We take a neural circuit policy, which is the first one, NCP.
That has a four-layer architecture-- again,
nature-inspired-- that has interneurons, command neurons,
and motor neurons, all LTC-based neurons
based on the masses I showed you before.
You can replace that fully connected layers
with LSTMs and CTRNNs, and you have
the convolutional neural network.
So I'm going to talk about differences of these four
variants.
So the first thing, the number of parameters
that requires to actually perform autonomous driving
is basically significantly reduced
when you're using these type of networks.
Now, remember the representation of the network where
I was showing that convolution on a fully
connected convolutional network can
get perturbed, the kind of representation they learn.
And now, with LTCs, we would be able to have
19 neurons at control.
And then we perform and see that the convolutional part of it--
so what I'm showing in the attention map,
we are not changing the convolutional neural network
structure of these variants, of these network
variants that I showed you.
We see that this architecture imposes an inductive bias
on the convolutional networks that let
them learn a causal structure.
Now, if you add, even, noise, we see that the explanations
are not scattered as much as it was
for convolutional neural networks.
We also take to a real-world measure of this,
like how many crashes would you have if you increase
the amount of input noise?
And you will see that these kind of networks
are basically much more robust to this type of perturbations.
And now let's look at the convolutional neural network
attention of these end-to-end trained networks
when their heads are different-- when
they had a CTRNN, when they had a LSTM,
and when they had our LTC-based model.
And we see that the kind of prior
that the recurrent neural network had
put on convolutional neural networks
makes them learn different types of weights.
So the representations that are learned out of this system
are completely different from each other.
And we see that the only one that has a consistent behavior
is the CNN itself in our solution.
But CNN actually focuses consistently
on the outside of the road, so we don't want that.
LSTM is actually giving you a good-- most of the time--
a good representation.
But it is actually sensitive to lighting condition.
So if I stop the video in some parts,
you will see that when the shading areas are not good,
the attention of that LSTMs are actually getting scattered.
And the CTRNN, or the neural ODEs,
basically cannot actually gain a nice representation in this
task.
Now, why is this the case?
Now, let's explore the why of this.
So if you look at the taxonomy of possible modeling
frameworks, at the bottom at one end of this--
I don't want to call it the bottom--
at one end of the spectrum, we have the statistical models
where statistical models are amazing in learning from data
and, at the same time, basically performing inference in IID,
so predicting in IID.
So this is actually what the statistical models can do.
On the other side of the spectrum,
we have physical models.
So physical models are basically described, usually,
by differential equations.
When you have differential equations that
describes the dynamics of your system,
they can actually answer questions.
They can account for interventions in the system.
So if you can actually design a universal approximator that
is closer to the physical kind of models,
then you would actually get into a more causal structure
by nature.
And also, you're being able to actually get
insights about the system.
You can learn from data.
You can answer counterfactual questions and predicting IID
and outs of distribution.
So as I said, physical dynamics can be modeled by ODEs.
And this set of ODEs can actually
predict the future evolution of your system.
They can describe the results of interventions in the system.
And the coupled-time evolution helps
us define averaging mechanisms for capturing
the statistical dependencies in data.
And it enhances our understanding
of the physical phenomena.
And because of that, they are actually causal structures.
So now, let me get more formal about this.
Let's say we have a differential equation given
by dx over dt equal to g.
And g of x is basically a nonlinearity of the system.
So we have the Picard-Lindelof theorem
that actually shows that this kind of differential equation
would have a unique solution if the nonlinearity is Lipschitz.
Now, if you unroll this system with Euler,
then the representation, the underlying representation
under this uniqueness condition, would be a causal mapping.
Why?
Because you can actually say what
happens in the future events, which is the xt plus dt based
on the previous events.
Now, there is a framework within this spectrum of causal models.
It's called dynamic causal model.
So a dynamic causal model has the nonlinearity
of the shape that you're seeing.
It does take a bilinear approximation,
or a second-order Taylor approximation, of that ODE.
And it gives you these coefficients for the system.
So coefficient 1 controls the internal coupling
of the system, A. Coefficient B controls
the coupling sensitivity among networks nodes.
So it actually accounts for internal interactions
and interventions.
And coefficient C regulates the external inputs.
This framework is actually a graphical model
that is implemented by ODEs.
So you can put these things together
to actually create this system.
They allow for feedback, as opposed
to their kind of Bayesian network architectures
that you can actually receive.
Now, if we look at the liquid neural networks,
or the representation that we gain from that representation,
under two conditions, that f is C1 mapping--
that means like f is Lipschitz-continuous,
basically, and is bounded--
I didn't write the bounded, no? no,
I didn't write that, so it has to be, also, bounded--
and tau is positive.
And if you have a strictly positive tau,
then this network would also have a unique solution.
Now, let's say I assume that this f, the nonlinearity,
is given by a tangent hyperbolic.
It has recurrent connections.
And it has weights like an input mapping.
And then, with this nonlinearity,
I would be able to compute the coefficients.
If you look at the coefficients for causal models,
we can compute the coefficients of this causal behavior.
So that means there are certain parameters of the system that
are responsible for a certain type of intervention
in the system-- internal intervention
and external intervention in the system.
Just from the diagram perspective--
going back to our diagram--
we will actually have a dynamic causal model
that can have the parameter B that
controls the amount of collaboration
of two nodes with each other, or interactions of two nodes,
and coefficient C that controls the inputs, or external inputs,
to the system.
You would have the same type of behavior--
it's a nonlinear version of that dynamic causal model--
that actually performs the same thing.
And they have more sophisticated causal structures.
Now, with that, we did some experiments.
They are behavioral cloning kind of experiments
where we have drone agents that are moving in the environment.
And they are given--
visually, there is actually a target in the environment.
And we ask the drones-- so actually, we
drive the drones towards that target.
And with this visual demonstration,
what we want to do, we want to learn this behavior and gain
agents that are good in closed loop when they're interacting
with the environment.
We see that this is actually a learned behavior
of this system, where as soon as the target becomes apparent,
then we see that this neural network actually learned
to focus on that target.
Because that's the kind of important matter
in this kind of task process.
So basically, the causal structure of the task
is learned by these drone agent.
Now, if you compare the kind of focus, or attention,
of these networks to other neural networks,
we see that the only representation that, actually,
we see this type of process is actually
the liquid network-based solutions, where
this attention is not persistent in the other ones.
So we cannot say that the other systems actually learned
to navigate towards the target and understood what they were
doing.
We also did that in multi-agent.
Right now, you're a follower drone.
And there is a leader drone in front of it.
And the target is basically to follow this drone.
And in this type of environment, also, we
observe that the attention of the network
is, actually, always on the second drone, basically.
So that means the causal structure is actually captured.
Now, how you can show this even more quantitatively?
Then we looked into close form interaction.
We trained these networks in open loop
and from training data.
Now, we deploy them, actually, in that environment.
And we measure the amount of success rate
that they can have in different type of tasks in closed loop.
So if they do not have their true causal structure
of the task, they wouldn't be able to perform this task very
well.
And we did across different kind of spectrum of perturbations
on the system.
We see that the systems are being
able to perform much better than the other ones.
Of course, there are always room for improvement, even
for these systems.
Because we didn't add any kind of constraint
on helping these systems to learn more and more.
So we were just trying to see what's
the gap between these type of networks and the others.
So obviously, these type of networks
come with certain limitations.
So the complexity of the networks
are basically tied to the complexity of their ODE solver.
So as a result, you might have longer training times
and longer test time if you use these networks.
You can have a solution for that.
You can use the fixed-step ODE solvers.
You can use the sparse flows.
You can use a sparsity--
and the process that optimizes sparse neural networks-- on,
let's say, CPUs or any kind of hardware
that you're running or GPUs.
And then you can use hypersolvers.
And these are the class of solvers where they can actually
integrate everything together, and they can actually
run much faster when you have differential equations.
You can also use closed-form variants
in these kind of scenarios.
So you can use the closed form--
if you solve these differential equations as closed form,
then you can end up with a nicer presentation.
And that's one of the things that we did
and we're very excited about.
So there's another limitation that this ODE-based network.
They might also express vanishing gradient problem.
Because they're continuous systems, and their memory
is given by an exponential decay.
So then, you would face learning long-term dependencies.
So the solution is that you wrap it inside a well-behaved kind
of process-- for example, a gating mechanism that you can
actually put these networks together--
for example, if you have the state of an LSTM network
defined by an LTC network.
So if you do that, then you would have gating mechanism,
and you have a gradient propagation
preserve the gradients.
Now, in summary, what I showed you
I showed you that you can acquire knowledge
by these flexible neural models that can
perform inference model-free.
They can really capture the temporal aspects of the task
that is at hand better than--
the tasks that require temporal kind of data processing,
they can actually infer the--
and these are all thanks to their causal structure.
And they would be able to perform credit assignment
better than the other models that are out there.
So you might use them for generative modeling.
And if you want to model the world,
you basically can use these representations
or also get representation of your world
in order to do further inference from those kind of models.
So there are certain properties that I mentioned--
the compositionality of, layer-wise, these networks,
you can actually put them in different architectures.
And you can connect them in a sparse fashion.
And the network is actually differentiable.
And you can use this.
And if you're dealing with visual data or video data,
it would be adding CNN heads or perception modules.
And then this can act as your decision-making engine.
They're expressive, they're causal,
and they add more into interpretability
of the networks.
So some of the perspectives that we have is that there is--
I just put two different hundred-years-old models
together, and this is all kind of properties that emerge
from those kind of things.
And you can see how much potential is actually
in this type of research that you can put,
and you can really explore what's going on in the brain.
And why do you need to do that?
Because, basically, the research space
is huge if you just want to algorithmically implement
something intelligence, right?
So you would narrow down if you actually focus on brains
and how they acquire knowledge.
And definitely, because we have these machine learning tools
these days, you would be able to actually do much more
than it was possible before.
We can also work with the objective functions.
In this talk, in this research that I showed,
we just focused on the model and the properties of the model
in a structured fashion.
So you can also work with the objective function
of your learning problem.
You can also, for learning processes,
you can use physics-informed kind
of learning processes in order to perform
this type of learning.
You can do causal entropic forces, for example.
This is like defining intelligence
as a force that maximizes the future freedom of action.
So that would be a new way of formulating intelligence.
And then, from there, you would be able to actually get
into much more.
So this is actually an exciting area of research
that could be enabled and scaled by what we showed today.
And as I said, one of the properties that we showed today
is that there are certain structures that can emerge
from these liquid networks.
And those structures are good.
So you would be able to use these for more complex tasks.
So these are good candidates--
this could be giving you some candidates
for performing decision-making, better decision-making, based
on these selective computations.
With that, I would like to thank you for your attention.
And all this technology is open source.
You can actually get them online.
Browse More Related Video
![](https://i.ytimg.com/vi/biz-Bgsw6eE/hq720.jpg)
The future of AI looks like THIS (& it can learn infinitely)
![](https://i.ytimg.com/vi/fk2r8y5TfNY/hq720.jpg?sqp=-oaymwEmCIAKENAF8quKqQMa8AEB-AH-CYAC0AWKAgwIABABGGUgZShlMA8=&rs=AOn4CLB7aGCzlnhdjiQ_qhMxtd4-GdfXFQ)
Miles Cranmer - The Next Great Scientific Theory is Hiding Inside a Neural Network (April 3, 2024)
![](https://i.ytimg.com/vi/oGvHtpJMO3M/hq720.jpg)
How Computer Vision Applications Work
![](https://i.ytimg.com/vi/cZaNf2rA30k/hq720.jpg)
Introduction to Generative AI
![](https://i.ytimg.com/vi/SN2BZswEWUA/hq720.jpg)
Understanding Artificial Intelligence and Its Future | Neil Nie | TEDxDeerfield
![](https://i.ytimg.com/vi/ythnIwpQCgQ/hq720.jpg)
The Next Generation Of Brain Mimicking AI
5.0 / 5 (0 votes)