Miles Cranmer - The Next Great Scientific Theory is Hiding Inside a Neural Network (April 3, 2024)
Summary
TLDRThe transcript discusses the emerging paradigm of interpreting neural networks for physical insights in scientific discovery. It highlights the potential of AI in learning complex models from limited data, exemplified by advances in fluid turbulence and planetary system instability prediction. The speaker emphasizes the importance of translating these models into interpretable mathematical language using symbolic regression. They also introduce the concept of polymathic AI, which involves creating large, flexible neural networks trained on diverse data to serve as foundational models for various scientific tasks, promoting a new approach to building theories in the physical sciences.
Takeaways
- đ§ The concept of interpreting neural networks for physical insights represents a new paradigm in scientific exploration.
- đ Success in using neural networks for scientific insights includes predicting instability in planetary systems and modeling fluid turbulence with high accuracy.
- đ Traditional scientific methods involve building theories from low-dimensional data, while modern AI-driven approaches use high-dimensional data and flexible functions.
- đ The speaker's motivation is to understand how neural networks achieve accuracy and to use these insights to advance scientific understanding.
- 𧏠The potential of machine learning in science is highlighted by the ability of neural networks to learn from data and find patterns not previously recognized.
- đ Symbolic regression is a technique used to interpret neural networks by finding analytic expressions that optimize to fit data sets.
- 𧏠The use of genetic algorithms in symbolic regression is akin to evolving equations to fit data, providing a bridge between machine learning models and mathematical language.
- đ Foundation models, like GPT for language, are proposed for science as a way to train on diverse data and then specialize for specific tasks, improving performance.
- đ The concept of 'polymathic AI' is introduced as a foundation model for science that can incorporate data across disciplines and be fine-tuned for particular problems.
- đ The importance of simplicity in scientific models is questioned, with the suggestion that what is considered simple may be based on familiarity and utility rather than inherent simplicity.
Q & A
What is the main motivation behind interpreting neural networks for physical insights?
-The main motivation is to extract valuable scientific insights from neural networks, which can potentially advance our understanding of various phenomena and contribute to the development of new theories in the physical sciences.
How does the traditional approach to science differ from the new paradigm of using neural networks?
-The traditional approach involves building theories based on low-dimensional data sets or summary statistics, whereas the new paradigm uses massive neural networks to find patterns and insights in large, complex data sets, and then builds theories around what the neural networks have learned.
Can you explain the concept of symbolic regression in the context of interpreting neural networks?
-Symbolic regression is a machine learning task that aims to find analytic expressions that optimize some objective by searching over all possible expression trees. It is used to build surrogate models of neural networks, translating the model into a mathematical language that is interpretable and familiar to scientists.
What is the significance of the universal approximation theorem in relation to neural networks?
-The universal approximation theorem states that a shallow neural network with a single layer of activations can approximate any 1D function to arbitrary accuracy. This highlights the power of neural networks in modeling complex relationships and functions in data.
How do foundation models like GPT differ from traditional machine learning models?
-Foundation models are trained on massive, diverse datasets and are flexible enough to serve as a basis for a wide range of tasks across different domains. They are first pre-trained on general data and then fine-tuned for specific tasks, whereas traditional models are often trained from scratch for a particular task.
What is the role of simplicity in the context of scientific discovery and interpretability?
-In the context of scientific discovery, simplicity often refers to the ability to describe complex phenomena with minimal assumptions or variables. It aids interpretability by providing clear, understandable explanations for observed data, which can lead to more effective models and theories.
How does the concept of pre-training neural networks relate to the development of polymathic AI?
-Pre-training neural networks on a broad range of data allows them to develop general priors for different types of problems, much like a well-rounded scientist. This approach is central to the development of polymathic AI, which aims to create models that can be fine-tuned for specific tasks across various scientific disciplines.
What are the potential challenges in training a foundation model for science, given the diversity of data types in different scientific fields?
-The main challenge lies in defining a general objective that can be applied to the diverse range of data types in science. The objective needs to be flexible enough to accommodate different data forms, such as sequences in molecular biology or images in astrophysics, while still enabling the model to learn broadly applicable concepts.
How does the concept of shared concepts across different physical systems relate to the training of foundation models?
-Shared concepts like causality and multiscale dynamics are common across various scientific disciplines. By training a foundation model on diverse datasets that encompass these shared concepts, the model can develop a general understanding of these principles, which can then be fine-tuned for specific tasks within particular fields.
What are the potential implications of polymathic AI for the future of scientific research?
-Polymathic AI has the potential to revolutionize scientific research by providing a generalizable foundation model that can quickly adapt to new tasks and problems. This could lead to faster discoveries, more efficient use of computational resources, and the development of new, broadly applicable scientific models.
Outlines
đ€ Introduction to Neural Network Interpretation
The speaker introduces the concept of interpreting neural networks to gain physical insights, which is seen as a new paradigm in science. The work involves collaboration with many individuals and focuses on extracting insights from neural networks, particularly in the context of fluid turbulence and planetary system instability. The speaker emphasizes the importance of understanding how neural networks achieve accuracy beyond traditional models and the potential for using these insights to advance scientific understanding.
đ§ Machine Learning Fundamentals and Activation Functions
The speaker delves into the fundamentals of machine learning, starting with linear regression and progressing to the concept of activation functions in neural networks. Activation functions introduce non-linearity, allowing the model to fit complex data. The speaker explains the construction of a shallow neural network and how it can be viewed as a piecewise linear model. The universal approximation theorem is mentioned, highlighting the ability of a neural network to approximate any 1D function to arbitrary accuracy.
đ Deepening Neural Networks and Function Composition
The speaker discusses the progression from shallow to deep neural networks, explaining how additional layers allow for the representation of functions in higher dimensional spaces. The concept of function composition is introduced, likening it to folding a piece of paper to create complex behaviors. The speaker emphasizes the efficiency of neural networks through shared computation and neurons, and how this relates to emulating physical processes and gaining interpretable insights through symbolic regression.
đ Symbolic Regression and Surrogate Models
The speaker explains the process of symbolic regression, a machine learning task aimed at finding analytic expressions that fit a dataset. This is done using a genetic algorithm-based search, which evolves equations to fit the data. The speaker discusses the use of symbolic regression to build surrogate models of neural networks, translating the model into a mathematical language. This technique is used to interpret and distill the behavior of neural networks into equations, providing a more interpretable approximation of the original model.
đ Polymathic AI and Foundation Models
The speaker introduces the concept of polymathic AI, focusing on building massive neural networks for science. Foundation models, which are trained on diverse data to learn general concepts, are discussed as a way to improve performance on downstream tasks. The speaker explains how these models can be fine-tuned for specific tasks, drawing parallels with language models. The potential for these models to discover new scientific insights and the importance of pre-training are highlighted, with examples of their application in physics.
đ Interpreting Neural Networks and Simplicity
The speaker discusses the interpretation of neural networks, particularly focusing on the concept of simplicity. The speaker argues that simplicity is based on familiarity and usefulness, and that broadly useful algorithms discovered by polymathic AI models may become familiar and simple over time. The speaker emphasizes the importance of understanding what is interpretable and how pre-training can provide a better starting point for neural networks in scientific applications.
đĄ Final Thoughts and Q&A
The speaker concludes with a discussion on the potential impact of polymathic AI on science and how it might change scientific teaching. The speaker also addresses questions about the scalability of training for foundation models, the potential for these models to discover new physics, and the challenges of symbolic regression with high-dimensional data. The speaker acknowledges that while pre-training has been beneficial in experiments so far, adversarial examples may exist that could hinder training in certain cases.
Mindmap
Keywords
đĄNeural Networks
đĄPhysical Insight
đĄData-Driven Models
đĄSymbolic Regression
đĄFoundation Models
đĄInterpretability
đĄPolymathic AI
đĄSubgrid Models
đĄPlanetary Instability
đĄGenetic Algorithm
Highlights
The speaker discusses a new paradigm of interpreting neural networks to gain physical insights, marking a shift in the approach to science.
The motivation stems from examples like using neural networks to learn subgrid models for fluid turbulence with high accuracy, outperforming traditional models.
Another example is predicting instability in planetary systems, a centuries-old problem, where neural networks have shown better accuracy and generalization.
The traditional approach to science involves building theories from low-dimensional data sets, but the speaker suggests using neural networks as a tool to describe the data and its patterns.
Neural networks, especially those trained on massive amounts of data, can discover new things not present in existing theories.
The speaker introduces the concept of using neural networks as a compression tool to extract common patterns from a data set.
The potential of machine learning models to serve as surrogate models for complex physical processes is highlighted.
Symbolic regression is introduced as a method to interpret data-driven models by finding analytic expressions that optimize some objective.
The speaker discusses the use of genetic algorithms for symbolic regression, evolving equations to fit the data set.
Foundation models are presented as a shift in industrial machine learning, which are trained on massive, diverse data and then fine-tuned for specific tasks.
The idea of polymathic AI is introduced, aiming to build massive neural networks for science that incorporate data across many disciplines.
The potential of polymathic AI to discover broadly useful algorithms and new scientific models is emphasized.
The speaker discusses the importance of pre-training models, even on unrelated data, to provide a better starting point than random initialization.
The concept of simplicity is explored, suggesting that what is considered simple is often based on familiarity and usefulness in describing the world.
The speaker argues that polymathic models will be broadly useful and may drive the simplicity of newly discovered concepts or algorithms.
The potential impact of polymathic AI on how science is taught and done is discussed, suggesting a significant shift in the scientific process.
The speaker addresses concerns about whether neural networks are rediscovering known physics or discovering new insights, emphasizing the importance of verification.
The limitations of symbolic regression for high-dimensional problems are acknowledged, and the potential for incorporating more general algorithms is discussed.
The speaker concludes by expressing excitement about the direction of polymathic AI and its potential to revolutionize scientific discovery and understanding.
Transcripts
so uh I'm very excited today to talk to
you about uh this idea of kind of
interpreting neural networks to get uh
physical Insight which I view as as kind
of a new really kind of a new paradigm
of of doing science um so this is a this
is a work with huge number of people um
I can't individually mention them all
but um many of them are here at the flat
IR Institute so I'm going to split this
up I'm going to do two parts the first
one I'm going to talk about kind of how
we go from a neural network to insights
how we actually get insights out of a
neural network the second part I'm going
to talk about this polymathic AI thing
um which is about basically building
massive uh neural networks for
science so
my motivation for this line of work is
uh examples like the
following so there was this paper led by
Kimberly stachenfeld at Deep Mind uh a
few a couple years ago on learning fast
subgrid models for fluid
turbulence um so what you see here is
the ground truth so this is kind of some
some box of a fluid uh the bottom row is
the the the Learned kind of subgrid
model essentially for this this
simulation um the really interesting
thing aart about this is that this
model was only trained on 16 simulations
but it it actually learned to be more
accurate than all traditional subgrid
models at that resolution um for fluid
dynamics so I think I think it's really
exciting kind of to figure out how did
the model do that and and kind of what
can we learn about science from this
from this uh neural
network uh another example is so this is
a work that uh I worked on with Dan too
and others on predicting instability in
planetary systems so this is a this is a
centuries old problem you have some you
know this this compact planetary system
and you want to figure out when does it
go un stable um there are literally I
mean people have literally worked on
this for
centuries um it's a fundamental problem
in chaos but this this neural network uh
trained on I think it was maybe 20,000
simulations um it's it's not only more
accurate at predicting instability but
it also seems to generalize better to
kind of different types of systems um so
it's it's really interesting to think
about okay this these neural networks
they've um they've seemed to have
learned something new how can we we
actually use that to advance our own
understanding so that's that's my
motivation here so the traditional
approach to science has been kind of you
have some low dimensional data set or
some kind of summary statistic and you
build theories to describe that uh
low-dimensional data um which might be
kind of a summary
statistic so you can look throughout the
history of science so maybe Kepler's Law
is an empirical fit to data
and then of course Newton's law of
gravitation was required to explain this
and another examples like Plank's law so
this was an actually an empirical fit to
data um and quantum mechanics was
required uh partially motivated by this
to um explain it
so this is this is uh kind of the the um
the normal approach to building theories
um and of course some of these they
they've kind of I mean it's not only
this it also involves you know many
other things but um I I think it's
really exciting to think about how we
can
involve
interpretation of datadriven models in
this process going to vary generally so
that's what I'm going to talk about
today uh I'm going to
conjecture that in this era of AI where
we have these massive neural networks
that kind of seem to outperform all of
our traditional the the um we might want
to consider this approach where we use a
neural network as essentially
compression
tool or some kind of uh tool that that
pulls apart common patterns um in uh a
data set and we build theories not to
describe the data directly but really
kind of to describe the neural network
and what the neural network has learned
um so I think this is kind of a exciting
new approach to I mean really really
science in general I think especially
the physical
sciences so the the key Point here is
neural networks trained on massive
amounts of data with with very
flexible functions they they seem to
find new things that are not in our
existing Theory so I showed you the
example with turbulence you know we can
find better subgrid models just from
data um and we can also do this with the
planetary
Dynamics so
I think our challenge as scientists for
those
problems is distilling those insights
into our language kind of incorporating
it in our Theory I think this is this is
a a really exciting way to kind of look
at these these
models so I'm going to break this down a
bit the first thing I would like to do
is just go through kind of what what
machine learning is how it works um and
then talk about this this uh kind of how
you app apply them to different data
sets Okay so just going back to the very
fundamentals uh linear regression in
1D this is I would argue if you don't
really have physical meaning to these
parameters yet it is a kind of type of
machine learning um and so this is a
it's these are scalers right X and Y
those are scalers 0 51 scalar parameters
linear
model you go One Step Beyond that and
you get this shallow Network so again
this has 1D input X 1D output y but now
we've
introduced this layer so we we have
these linear
models so we have
three hidden neurons here and they pass
through this function a so this is
called an activation function and what
this does is it gives the model a way of
uh including some
nonlinearity
so these are called activation functions
the the the one that most people would
reach for first is the rectified linear
unit or reu essentially what this does
is it says if the input is less than
zero drop it at zero greater than zero
leave it um this is a very simple way of
adding some kind of nonlinearity to my
flexible curve that I'm going to fit to
my data
right
um the next thing I do is I have these I
have these
different activation functions they have
this this kind of joint here at
different different points which depends
on the
parameters and I'm going to multiply the
output of these activations by number so
that's that's kind of the the output of
my kind of a layer of the neural network
um and this is going to maybe change the
direction of it um change the slope of
it the next thing I'm going to do is I'm
going to sum these up I'm going to
superimpose them and I get this is the
output of one layer in my network so
this is a shallow Network essentially
what it is it's a piecewise linear model
okay and the the joints here the parts
where it kind of switches from one
linear region to another those are
determined by the inputs to the the
first layers activations so it's it's
basically a piecewise linear model okay
it's a piecewise linear model um
and the one cool thing about it is you
can use this piecewise linear model to
approximate any 1D function to arbitrary
accuracy so if I want to model this
function with five joints I can get an
approximation like this with 10 joints
like this 20 like that and I can just
keep increasing the number of these
neurons that gives me better and better
approximations um so this is called the
universal approximation theorem so it's
it's that my uh shallow neural network
right it just has one one kind of layer
of activations I can describe any
continuous function um to arbitrary
Precision now that's not I mean this
alone is not uh that exciting because
like I can do that with pols right like
I don't I don't need like the neural
network is not the only thing that does
that I think the exciting part about
neural networks is when you start making
them deeper so first let's look at what
if we had two inputs what would it look
like if we had two inputs now these
activations they are activated along
planes not not points they're activated
along planes so for this is my maybe my
input plane I'm basically chopping it
along the the Zero part and now I have
these 2D planes in
space okay and the next thing I'm going
to do I'm going to scale
these and then I'm going to superimpose
them and this gives me ways of
representing
kind of arbitrary functions in now a 2d
space rather than just a 1D space so it
gives me a way of
expressing um you know arbitrary
continuous functions okay now the cool
part oops the cool part here is when I
want to do two two layers okay so now I
have two layers so I have this this is
my first neural Network this is my
second neural network and my first
neural network looks like this okay if I
consider it alone it looks like this my
second um neural network it looks like
this if I just like I cut this neural
network out it looks like this okay when
I compose them
together I get this this this shared um
kind of behavior where so I'm I'm
composing these functions together and
essentially what happens
is it's almost like you
fold the functions together so that I
experience that function in this linear
region and kind of backwards and then
again so you can see there's there's
kind of like that function is mirrored
here right it goes goes back and forth
um so you can make this analogy to
folding a piece of paper so if I
consider my first neural network like
like this on a piece of paper I could
essentially Fold It draw my second
neural network the function over that
that first one and then expand it and
essentially now I have this this uh
function
so the the cool part about this is that
I'm sharing I'm kind of sharing
computation because I'm sharing neurons
in my neural network um so this is going
to come up again this is kind of a theme
we're we're doing efficient computation
in neural networks by sharing
neurons and it's it's useful to think
about it in this this this way kind of
folding paper drawing curves over it and
expanding
it um okay so let's go back to the
physics now neural
networks uh right they're efficient
Universal function approximators you can
think of them as kind of like a type of
data
compression the same neurons can be used
for different
calculations uh in the same network um
and a common use case uh in in physical
sciences especially what I work on is
emulating physical processes so if I
have some my my simulator is kind of too
expensive or I have like real world data
my simulator is not good at describing
it I can build a neur neural network
that maybe emulates it so like I have a
neural network that looks at kind of the
initial conditions in this model and it
predicts when it's going to go
unstable so this is a this is a good use
case for them um and once I have that so
maybe I have this I have this trained
piecewise linear model that kind of
emulates some physical
process now how do I take that and go to
uh interpret it how do I actually get
insight out of
it so this is where I'm going to talk
about symbolic regression so this is one
of my favorite things so a lot of the
interpretability work in uh industry
especially like computer vision language
there's not really like there's not a
good modeling language like if I have a
if I have a model that classifies cats
and dogs there's not really like there's
not a language for
describing every possible cat there's
not like a mathematical framework for
that but in science we do have that we
do have um
oops we do have a very
good
uh mathematical
framework let me see if this
works uh so in science right so we have
this you know in science we have this
very good understanding of the
universe and
um we have this language for it we have
mathematics which describes the universe
very well uh and I think when we want to
interpret these datadriven models we
should use this language because that
will give us results that are
interpretable if I have some piece-wise
linear model with different you know
like millions of parameters it's not
it's not really useful for me right I
want to I want to express it in the
language that I'm familiar with which is
uh
mathematics um so you can look at like
any cheat sheet and it's uh it's a lot
of you know simple algebra this is the
language of
science so symbolic regression is a
machine learning task where the
objective is to find analytic
Expressions that optimize some objective
so maybe I uh maybe I want to fit that
dat set and uh what I could do is
basically try different trees so these
are like expression
trees right so this equation is that
tree and I basically find different
expression trees that uh match that data
so the point of symbolic regression I
want to find equations that fit the data
set so the symbolic and the parameters
rather than just optimizing parameters
in some
model so the the the current way to do
this the the state-of-the-art way is a
genetic algorithm so it's it's kind of
um it's not really like a clever
algorithm it's it's uh I can say that
because I work on it it's a it's it's
pretty close to Brute Force essentially
what you do is you treat your equation
like a DNA sequence and you basically
evolve it so you do like mutations you
swap one operator to another maybe maybe
you crossbreed them so you have like two
expressions which are okay you literally
breed those together I mean not
literally but you conceptually breed
those together get a new expression um
until you fit the data set
um
so yeah so this is a genetic algorithm
based search uh for symbolic regression
now
the the point of this is uh to find
simple models in our language of
mathematics that describe uh a given
data
set so um so I've spent a lot of time
working on these Frameworks so piser
symbolic regression.
JL um
they they work like this so if I have
this expression I want to model that
data set essentially what I'm going to
do is just search over all possible
Expressions uh until I find one that
gets me closer to this ground truth
expression so you see it's kind of
testing different different branches in
evolutionary space I'm going to play
that
again until it reaches this uh ground
truth data set so this is this is pretty
close to how it
works uh you're essentially finding
simple Expressions that fit some data
set accurately
okay
so what I'm going to show you how to do
is this symbolic regression idea is
about fitting kind of finding models
symbolic models that I can use to
describe a data set I want to use that
to build surrogate models of my neural
network so this is this is kind of a way
of translating my model into my language
you could you could also think of it as
like polom uh or like a tailor expansion
in some
ways the way this works is as
follows if I have some neural network
that I've trained on my data set
whatever I'm going to train it normally
freeze the
parameters then what I do is
I record the inputs and outputs I kind
of treat it like a data generating
process I I try to see like okay what's
the behavior for this input this input
and so on then I stick those inputs and
outputs into piser for example and I I
find some equation that models that
neural network or maybe it's like a
piece of my neural
network so this is a this is building a
surrogate model for my neural network
that is kind of a a Pro imates the same
behavior now you wouldn't just do this
for like a standalone neural network
this this would typically be part of
like a larger model um and it would give
you a way of interpreting exactly what
it's doing for different
inputs so what I might have is maybe I
have like two two pieces like two neural
networks here maybe I think the first
neural network is like learning features
or it's learning some kind of coordinate
transform the second one is doing
something in that space uh it's using
those features for
calculation um and so I can using
symbolic regression uh which we call
symbolic distillation I can I can
distill this model uh into
equations so that's that's the basic
idea of this I
replace neural networks so I replaced
them with my surate model which is now
an equation
um you would typically do this for G as
well and now I have equations that
describe my
model um and this is kind of a a
interpretable approximation of my
original neural network now the reason
you wouldn't want to do this for like
just directly on the data is because
it's a harder search problem if you
break it into
pieces like kind of interpreting pieces
of a neural network it's easier because
you're only searching for
2 N Expressions rather than n s so it's
a it's a bit easier and you're kind of
using the Neal Network as a way of
factoring factorizing the system into
different pieces that you then
interpret um so we've we've used this in
in different papers so this is one uh
led
by Pablo Lemos on uh rediscovering
Newton's law of gravity from data
so this was a this was a cool paper
because we didn't tell it the masses of
the bodies in the solar system it had to
simultaneously find the masses of every
all of these 30 bodies we gave it and it
also found the law um so we kind of
train this neural network to do this and
then we interpret that neural network
and it gives us uh Newton's law of
gravity um now that's a rediscovery and
of course like we know that so I think
the discoveries are also cool so these
are not my papers these are other
people's papers I thought they were
really exciting so this is one a recent
one by Ben Davis and jial Jinn where
they discover this new uh blackhole Mass
scaling
relationship uh so it's uh it relates
the I think it's the spirality or
something in a galaxy in the velocity
with the mass of a black hole um so they
they found this with this technique uh
which is exciting um and I saw this
other cool one recently um they found
this cloud cover model with this
technique uh using piser um so they it
kind of gets you this point where it's a
it's a fairly simple model and it's also
pretty accurate um but again the the
point of this is to find a model that
you can understand right it's not this
blackbox neural network with with
billions of parameters it's a it's a
simple model that you can have a handle
on okay so that's part one now part two
I want to talk about polymathic AI so
this is kind of like the complete
opposite end we're going to go from
small models in the first part now we're
going to do the biggest possible models
um and I'm going to also talk about the
meaning of Simplicity what it actually
means so
the past few years you may have noticed
there's been this shift in indust
industrial machine learning to favor uh
Foundation models so like chat GPT is an
example of this a foundation model is a
machine learning model that serves as
the foundation for other
models these models are trained by
basically taking massive amounts of
General diverse data
uh and and training this flexible model
on that data and then fine-tuning them
to some specific task so you could think
of it as maybe teaching this machine
learning model English and French before
teaching it to do translation between
the two um so it often gives you better
performance on Downstream tasks I mean
you can also see that I mean Chad gbt is
uh I've heard that it's trained on um
GitHub and that kind of teaches it to uh
reason a bit better um and so the I mean
basically these models are trained on
massive amounts of data um and they form
this idea called a foundation
model so um the general idea is you you
collect you know you collect your
massive amounts of data you have this
very Flex ible model and then you train
it on uh you might train it to do uh
self supervised learning which is kind
of like you mask parts of the data and
then the model tries to fill it back in
uh that's a that's a common way you
train that so like for example GPT style
models those are basically trained on
the entire internet and they're trained
to predict the next word that's that's
their only task you get a input sequence
of words you predict the next one and
you just repeat that for uh massive
amounts of text and then just by doing
that they get really good at um General
language understanding then they are
fine-tuned to be a chatbot essentially
so they're they're given a little bit of
extra data on uh this is how you talk to
someone and be friendly and so on um and
and that's much better than just
training a model just to do that so it's
this idea of pre-training
models so I mean once you have this
model I I think like kind of the the the
cool part about these models is they're
really trained in a way that gives them
General priors for data so if I have
like some maybe I have like some artwork
generation model it's trained on
different images and it kind of
generates different art
I can fine-tune this model on like
studio gibli artartwork and it doesn't
need much training data because it
already knows uh what a face looks like
like it's already seen tons of different
faces so just by fine tuning it on some
small number of examples it can it can
kind of pick up this task much quicker
that's that's essentially the idea
now this is I mean the same thing is
true in language right like if I if I
train a model on uh if I train a model
just to do language
translation right like I just teach it
that it's kind of I start from scratch
and I just train it English to French um
it's going to struggle whereas if I
teach it English and French kind of I I
teach it about the languages first and
then I specialize it on translation um
it's going to do much
better so this brings us to science so
in
um in science we also have this we also
have this idea where there are shared
Concepts right like different languages
have shared there's shared concept of
grammar in different languages in
science we also have shared Concepts you
could kind of draw a big circle around
many areas of Science and causality is a
shared concept uh if you zoom in to say
dynamical systems um you could think
about like multiscale Dynamics is is
shared in many different disciplines uh
chaos is another shared concept
so maybe if we train a general
model uh you know over many many
different data sets the same way Chad
GPT is trained on many many different
languages and and text databases maybe
they'll pick up general concepts and
then when we finally make it specialize
to our particular problem uh maybe
they'll do it it'll find it easier to
learn so that's essentially the
idea so you can you can really actually
see this for particular systems so one
example is the reaction diffusion uh
equation this is a type of PD um and the
shallow water equations another type of
PD different fields different pdes but
both have
waves so they they both have wav like
Behavior so I mean maybe if we train
this massive flexible model on both of
these system it's going to kind of learn
a general prior for uh what a wave looks
like and then if I have like some you
know some small data set I only have a
couple examples of uh maybe it'll
immediately identify oh that's a wave I
know how to do that um it's it's almost
like I mean I kind of feel like in
science today what we often do
is I mean we train machine learning
models from scratch it's almost like
we're taking uh Toddlers and we're
teaching them to do pattern matching on
like really Advanced problems like we we
have a toddler and we're showing them
this is a you know this is a spiral
galaxy this is an elliptical galaxy and
it it kind of has to just do pattern
matching um whereas maybe a foundation
model that's trained on broad classes of
problems um it's it's kind of like a
general uh science graduate maybe um so
it has a prior for how the world works
it has seen many different phenomena
before and so when it when you finally
give it that data set to kind of pick up
it's already seen a lot of that
phenomena that's that's really the of
this uh that's why we think this will
work
well okay so we we created this
collaboration last year uh so this
started at flat iron Institute um led by
Shirley ho to
build this thing a foundation model for
science so this uh this is across
disciplines so we want to you know build
these models to incorporate data across
many different disciplines uh across
institutions um and uh so we're we're
currently working on kind of scaling up
these models right now the
final I think the final goal of this
collaboration is that we would release
these open-source Foundation models so
that people could download them and and
fine-tune them to different tasks so
it's really kind of like a different
Paradigm of doing machine learning right
like rather than the current Paradigm
where we take a model randomly
initialize it it's kind of like a like a
toddler doesn't know how the world Works
um and we train that this Paradigm is we
have this generalist science model and
you start from that it's kind of a
better initialization of a
model that's that's the that's the pitch
of
polymathic okay so we have results so
this year we're kind of scaling up but
uh last year we had a couple papers so
this is one uh led by Mike mccab called
multiple physics
pre-training this paper looked at what
if we have this General PD simulator
this this model that learns to
essentially run fluid Dynamic
simulations and we train it on many
different PDS will it do better on new
PDS or will it do worse
uh so what we found is that a single so
a single model is not only able to match
uh you know single uh single models
trained on like specific tasks it can
actually outperform them in many cases
so it it does seem like if you take a
more flexible model you train it on more
diverse
data uh it will do better in a lot of
cases I mean it's it's not
unexpected um because we do see this
with language and vision um but I I
think it's still really cool to uh to
see
this so um I'll skip through some of
these so this is like this is the ground
truth data and this is the
Reconstruction essentially what it's
doing is it's predicting the next step
all right it's predicting the next
velocity the next density and pressure
and so on and you're taking that
prediction and running it back through
the model and you get this this roll out
simulation so this is a this is a task
people work on in machine
learning um I'm going to skip through
these uh and essentially what we found
is that uh most of the time by uh using
this multiple physics pre-training so by
training on many different PDS you do
get better performance so the ones at
the right side are the uh multiple
physics pre-trained models those seem to
do better in many cases and it's really
because I mean I think because they've
seen you know so many different uh PDS
it's like they have a better prior for
physics
um skip this as well so okay this is a
funny thing that we observed is that
so during talks like this one thing that
we get asked is how similar do the PDS
need to be like do the PDS need to be
you know like navor Stokes but a
different
parameterization or can they be like
completely different physical systems so
what we found
is uh
really uh hilarious is that okay so the
bottom line here this is the air of the
model
uh over different number of training
examples so this model was trained on a
bunch of different PDS and then it was
introduced to this new PD problem and
it's given that amount of data okay so
that does the best this model it's
already it already knows some Physics
that one does the best the one at the
top is the worst this is the model
that's trained from scratch it's never
seen anything uh this is like your
toddler right like it's never it doesn't
know how the physical world Works um it
was just randomly initialized and it has
to learn physics okay the middle models
those are pre-trained on General video
data a lot of which is Cap videos so
even pre-training this model on cap
videos actually helps you do much better
than this very sophis phisticated
Transformer architecture that just has
never seen any data and it's really
because I mean we think it's because of
shared concepts of spaciotemporal
continuity right like videos of cats
there's a you know there's there's a
spaciotemporal
continuity like the cat does not
teleport across the video unless it's a
very fast cat um there's related
Concepts right so I mean that's that's
what we think but it's it's really
interesting that uh you know
pre-training on completely unrelated
systems still seems to help
um and so the takeaway from this is that
you should always pre-train your model
uh even if the physical system is not
that related you still you still see
benefit of it um now obviously if you
pre-train on related data that helps you
more but anything is basically better
than than nothing you could basically
think of this as the
default initialization for neural
networks is garbage right like just
randomly initializing a neural network
that's a bad starting point it's a bad
prior for physics you should always
pre-train your model that's the takeaway
of this okay so um I want to finish up
here with kind of rhetorical questions
so I started the talk about um
interpretability and kind of like how do
we extract insights from our model now
we've we've kind of gone into this
regime of these very large very flexible
Foundation models that seem to learn
general
principles so okay my question for you
you don't have to answer but just think
it over is do you think 1 + 1 is
simple it's not a trick question do you
think 1 + 1 is simple so I think most
people would say yes 1+ 1 is
simple and if you break that down into
why it's simple you say okay so X Plus Y
is simple for like X and Y integers
that's a simple relationship okay why Y
is X Plus y
simple and and you break that down it's
because plus is simple like plus is a
simple operator okay why why is plus
simple it's a very abstract
concept okay it's it's we we don't
necessarily have plus kind of built into
our brains um it's it's kind of I mean
it's it's really
uh so I'm going to show this this might
be controversial but I think that
Simplicity is based on familiar
we are used to plus as a concept we are
used to adding numbers as a concept
therefore we call it
simple you can go back another step
further the reason we're familiar with
addition is because it's useful adding
numbers is useful for describing the
world I count things right that's useful
to live in our universe it's useful to
count things to measure things addition
is
useful and it's it's it's really one of
the most useful things so that is why we
are familiar with it and I would argue
that's why we think it's
simple but the the Simplicity we have
often argued is uh if it's simple it's
more likely to be useful I think that is
actually not a statement about
Simplicity it's actually a statement
that if if something is useful for
problems like a b and c then it seems it
will also be useful for another problem
the the the world is compositional if I
have a model that works for this set of
problems it's probably also going to
work for this one um so that's that's
the argument I would like to make so
when we interpret these models I think
it's important to kind of keep this in
mind and and and really kind of probe
what is simple what is
interpretable
so I think this is really exciting for
polymathic AI because these models that
are trained on many many systems they
will find broadly useful algorithms
right they'll they'll they'll have these
neurons that share calculations across
many different disciplines so you could
argue that that is the utility and I
mean like maybe we'll discover new kind
of operators and be familiar with those
and and and we'll start calling those
simple so it's not necessarily that all
of the uh things we discover in machine
learning will be uh simple it it's uh
kind of that by definition the polymath
models will be broadly
useful and if we know they're broadly
useful we might we might might get
familiar with those and and that might
kind of Drive the Simplicity of them um
so that's my node on Simplicity and so
the the takeaways here are that I think
interpreting a neural
network trained on some data sets um
offers new ways of discovering
scientific insights from that data um
and I I think Foundation models like
polyic AI I think that is a very
exciting way of discovering new broadly
applicable uh scientific models so I'm
really excited about this direction uh
and uh thank you for listening to me
[Applause]
today great U so three
questions one was the
running
yeah when it's fully built out is to be
free
yeah please use your seat
mic
yeah and three
you're pretty
young okay so I'll try to
compartmentalize those okay so the first
question was the scale of training um
this is really an open research question
we don't have the scaling law for
science yet we have scaling laws for
language we know that if you have this
many gpus you have this size data set
this is going to be your performance we
don't have that yet for science cuz
nobody's built this scale of model um so
that's something we're looking at right
now is what is the tradeoff of scale and
if I want to train this model on many
many gpus is it is it worth it um so
that's an that's an open research
question um I do think it'll be large
you know
probably order hundreds of gpus uh
trained for um um maybe a couple months
um so it's going to be a very large
model um that's that's kind of assuming
the scale of language models um now the
model is going to be free definitely
we're we're uh we're all very Pro open
source um and I think that's I mean I
think that's really like the point is we
want to open source this model so people
can download it and use it in science I
think that's really the the most
exciting part about this um and then I
guess the Third question you had was
about the future um and how it
changes uh how we
teach um I mean I guess uh are you are
you asking about teaching science or
teaching machine learning teaching
science I see
um I mean yeah I mean I don't know it
depends if it if it works I think if it
works it it might very well like change
how how science is taught
um yeah I mean so I don't I don't know
the impact of um language models on
computational Linguistics I'm assuming
they've had a big impact I don't know if
that's affected the teaching of it yet
um but if if you know scientific
Foundation models had a similar impact
I'm sure I'm sure it would impact um I
don't know how much it probably depends
on the success of the
models I I have a question about your
foundation models also so in different
branches of science the data sets are
pretty different in molecular biology or
genetics the data sets you know is a
sequence of DNA versus astrophysics
where it's images of stars so how do you
plan to you know use the same model you
know for different different form of
data sets input data sets uh so you mean
how to pose the objective yes so I I
think the most I mean the most General
objective is self-supervised learning
where you basically mask parts of the
data and you predict the missing part if
you can you know optimize that problem
then you can solve tons of different
ones you can do uh regression predict
parameters or go the other way and
predict rollouts of the model um it's a
really General problem to mask data and
then fill it back in that kind of is a
superset of uh many different prediction
problems yeah and I think that's why
like language models are so broadly
useful even though there train just on
next word prediction or like B is a
masked
model thanks uh can you hear me all
right so um that was a great talk um I'm
Victor uh so uh I'm actually a little
bit uh worried and this is a little bit
of a question whenever you have models
like this um you said that you train
this on many examples right so imagine
you have already embedded the laws of
physics here somehow like let's say the
law of ration but when you when you
think about like this c new physics we
always have this question whether we are
you know actually Reinventing the wheel
or like the uh the network is kind of
really giving us something new or is it
something giving us uh or it's giving us
something that you know it it learned
but it's kind of wrong so in sometimes
we have the answer to know you know
which one is which but if you don't have
that let's say for instance you're
trying to discover what dark matter is
which you know something I'm working on
how would you know that the networ is
actually giving you something new and
not you know just trying to set this
into one of the many parameters that it
has I see um
so okay
so so if you want to test the model by
letting it ReDiscover something then I
don't think you should use this I think
you should use the scratch model like
from scratch and train it because if you
TR if you use a pre-train model it's
probably already seen that physics so
it's biased towards it in some ways so
if you're rediscovering something I
don't think you should use this if
you're discovering something new um I do
think this is more useful um so I think
a like a a
misconception of of uh I think machine
learning in general is that scientists
view machine learning for uninitialized
models like randomly initialized weights
as a neutral prior but it's not it's a
very uh it's a very explicit prior um
and it happens to be a bad prior um so
if you train from a a randomly
initialized model it's it's kind of
always going to be a worse prior than
training from a pre-train model which
has seen many different types of physics
um I think I think we can kind of make
that statement um so if you're if you're
trying to discover new physics I I mean
I mean like if it if you train it on
some data set um I guess you can always
verify that it that the predictions are
accurate so that would be um I guess one
way to to verify it um but I I do think
like the fine-tuning here so like taking
this model and training it on the task I
think that's very important I think in
language models it's not it's not as
emphasized like people will just take a
language model and and tweak the prompt
to get a better result I think for
science I think the prompt is I mean I
think like the equivalent of the prompt
would be important but I think the fine
tuning is much more important because
our data sets are so much different
across
science the
back that the
symbolic lied the dimensionality of the
system so are you introducing also the
funing and transfer learning a
way
en uh yeah
so so the symbolic regression I mean I
would consider that it it's not used
inside the foundation model part I think
it's
interesting to interpret the foundation
model and see if there's kind of more
General physical Frameworks that it
comes up with
um I think yeah symbolic regression is
very limited in that it's bad at high
dimensional
problems I think that might
be because of the choice of operators um
like I think if you can consider maybe
High dimensional operators you you might
be uh a bit better off I mean symbolic
regression it it's uh it's an active
area of research and I think the hardest
the biggest hurdle right now is it's uh
it's not good at finding very complex
symbolic
models
comp so um I guess uh you
could it depends like on the
dimensionality of the data
um I guess if it's very high dimensional
data you're always kind of um like
symbolic regression is not good to high
dimensional data unless you can have
kind of some operators that aggregate to
lower dimensional uh
spaces um I don't yeah I don't know if
I'm answering your question or
not okay I wanted to ask a little bit so
like when you were showing the
construction of these trees each
generation in the different operators I
think this is related to kind of General
themes of the talk and other questions
but often in doing science when you're
learning it you're presented with kind
of like algi to solve problems like you
know diagonalize hilon or something like
that what how do you encapsulate that
aspect of doing science that is kind of
the almic side soling problem
rather right please use your mic oh yeah
uh yeah so the question was about um how
do you incorporate kind of more General
uh not analytic operators but kind of
more General algorithms like a
hamiltonian operator um I think that I
mean like in principle symbolic
regression is it's part of a larger
family of an algorithm called program
synthesis where the objective is to find
a program you know like code that
describes a given data set for example
so
if you can write your
operators into your symbolic regression
approach and your symbolic regression
approach has that ground truth model in
there somewhere then I think it's
totally possible I think like it's it's
uh it's harder to do I think like even
symbolic regression with scalers is uh
it's fairly it's fairly difficult to to
actually set up an algorithm um I think
I don't know I think it's really like an
engineering problem but the the the
conceptual part is uh is totally like
there for this
yeah thanks um oh
sorry okay um this this claim uh that
random initial weights are always bad or
pre-training is always good I don't know
if they're always bad but um it seems
like from our
experiments it's we've never seen a case
where
pre-training um on some kind of physical
data hurts like the cap video is is an
example we thought that would hurt the
model it didn't that is a cute example
weird I'm sure there's cases where some
pre-training hurts yeah so that that's
essentially my question so we're aware
of like adversarial examples for example
you train on Mist add a bit of noise it
does terrible compared to what a human
buo what do you think adversarial
examples look like in science yeah yeah
I mean I don't I don't know what those
are but I'm sure they exist somewhere
where pre-training on certain data types
kind of messes with training a bit um we
don't know those yet but uh yeah it'll
be interesting do you think it's a
pitfall though of like the approach
because like I have a model of the sun
and a model of DNA you know it's yeah
yeah I mean um I don't know like um I
guess we'll see um yeah it's it's hard
to it's hard to know like I guess from
language we've seen you can pre-train
like a language model on video data and
it helps the language which is really
weird but it it does seem like if
there's any kind of Concepts it does if
it's flexible enough it can kind of
transfer those in some ways so we'll see
I mean there's I mean presumably we'll
find some adversarial examples there so
far we haven't we thought the cat was
one but it wasn't it it
helped
5.0 / 5 (0 votes)