Foundation models and the next era of AI
Summary
TLDRThe video discusses recent advances in AI, focusing on large language models like GPT-3 and ClaudeAI's ChatGPT. It outlines key innovations enabling progress: Transformer architectures, massive scale, and few-shot in-context learning. Models now solve complex benchmarks rapidly and power products like GitHub Copilot. But open challenges remain around trust, safety, personalization and more. We are still early in realizing AI's full potential; more accelerated progress lies ahead as models integrate with search, tools and experiences - creating ample research opportunities.
Takeaways
- 😲 AI models have made huge advances in generative capabilities recently, with high quality text, image, video and code generation
- 😎 Transformers have come to dominate much of AI, with their efficiency, scalability and attention mechanism
- 🚀 Scale of models and data keeps increasing, revealing powerful 'emerging capabilities' at critical mass
- 💡 In-context learning allows models to perform new tasks well with no gradient updates, just prompts
- 👍 Chain of Thought prompting guides models to reason step-by-step, greatly improving performance
- 📈 Benchmarks are being solved rapidly, requiring constant refresh and expansion to track progress
- 🤖 Large language models integrated into products are transforming user experiences e.g. GitHub Copilot
- 🔎 Integrating LLMs with search and other tools has huge potential but also poses big challenges
- ☁️ We're still at the very start of realizing AI's capabilities - more advances coming quickly!
- 😊 AI progress is accelerating and affecting everyday products - exciting times ahead!
Q & A
What architectural innovation allowed AI models to achieve superior performance on perception tasks?
-The Transformer architecture, which relies on an attention mechanism to model interdependence between different components in the input and output.
How did the introduction of in-context learning change the way AI models can be applied to new tasks?
-In-context learning allows models to perform new tasks directly from pretrained versions, without additional training data or tuning. This expands the range of possible applications and reduces the effort to deploy models on new tasks.
What training innovations were introduced in ChatGPT compared to previous self-supervised models?
-ChatGPT introduced instruction tuning on human-generated prompt-response examples and reinforcement learning from human preferences over model responses.
Why is benchmarking progress in AI becoming increasingly challenging?
-Benchmarks are being solved at an accelerating pace by advancing models, often within months or even weeks of release, limiting their usefulness for measuring ongoing progress.
How does GitHub Copilot demonstrate the rapid transition of AI from research to product?
-GitHub launched Copilot, which assists developers by generating code, shortly after the underlying AI model was created. Studies show it makes developers 55% more productive on coding tasks.
What are some limitations of language models that can be addressed by connecting them with search engines or other external tools?
-Language models have limitations relating to reliability, factual correctness, access to recent information, provenance tracking, etc. Connecting them to search engines and knowledge bases can provide missing capabilities.
What user experience challenges are introduced when language models are integrated into existing products like search engines?
-Challenges include revisiting assumptions about metrics, evaluation, personalization, user satisfaction, intended usage patterns, unintended behavior changes, and how to close feedback loops.
What evidence suggests we are still in the early stages of realizing AI's capabilities?
-The rapid pace of recent innovations, waves of new applications to transform existing products, and many remaining open challenges around aspects like safety and reliability indicate the technology still has far to progress.
How did training language models jointly on text and code give better performance?
-Training on code appeared to help ground models' reasoning and understanding of structured relationships between elements, transfering benefits to other language tasks.
What techniques have researchers proposed for further improvements by training AI systems on human feedback?
-Ideas include prompt-based training, preference learning over model responses, and reinforcement learning from human judgments.
Outlines
📈 Overview of AI progress in recent years
The paragraph provides an overview of the major advances in AI over the past 5-10 years. It discusses progress in perception tasks like image/speech recognition and then shifts to recent breakthroughs in generative AI for text, images and video. It highlights models like DALL-E 2, Imagen, Stable Diffusion that showcase quality image generation capabilities.
📊 Key factors behind current AI capabilities
The paragraph discusses 3 key factors that have led to current AI capabilities - Transformer architectures, scale/compute, and in-context learning. It provides details on how Transformers have come to dominate NLP and other modalities. It also covers how scale leads to emerging capabilities, using arithmetic word problem solving as an example that shows performance jumps at a critical scale.
👩💻 New paradigm of in-context learning
The paragraph explains the shift to in-context learning, where pre-trained models can be used to perform new tasks just with prompting and examples instead of fine-tuning. This reduces data needs, effort, and allows models to be applied to more tasks. Performance has been strong in few-shot settings across various tasks. It adapts tasks to models vs models to tasks.
🤖 Novel aspects of ChatGPT training process
The paragraph analyzes key aspects of ChatGPT's training - use of text and code, instruction tuning on human demonstrations, and reinforcement learning from human preferences. These align the model better to generate high-quality responses tuned to human judgments and interactions. Training on code especially helps with following instructions and reasoning.
🚀 Foundation models transforming products
The paragraph discusses examples of foundation models driving impact in products - GitHub Copilot for coding and Bing search integration. Studies show Copilot drives 55% higher developer productivity. Search integration handles complex, multi-step tasks automatically by orchestrating queries in the background and synthesizing an answer.
🌄 Opportunities and challenges moving forward
The paragraph concludes by summarizing tremendous progress but noting we are still early in realizing AI's full potential. Many challenges remain around trust, bias, user experience evaluation etc. But also opportunities to improve models further and apply them to transform more products used daily.
Mindmap
Keywords
💡Foundation models
💡Transformers
💡Scale
💡In-context learning
💡Instruction tuning
💡Reinforcement learning
💡Benchmarking
💡GitHub Copilot
💡Search
💡Challenges
Highlights
AI has been making significant impact on perception tasks like image recognition, speech recognition and language understanding.
The frontier of AI has changed toward generative AI, with progress in areas like text generation, image generation, video generation and code generation.
Transformers have been dominating the field of AI, relying on the attention mechanism to model interdependence between input and output data.
Scale of compute used for training has led to emerging capabilities, where models demonstrate new abilities only when reaching critical mass.
In-context learning allows models to perform new tasks out of the box without additional data or training, just using prompts.
Chain of Thought prompting shows models the steps to solve a problem, significantly improving performance on complex tasks.
Training language models on both text and code seems to ground them, allowing better reasoning and understanding structured relations.
Instruction tuning exposes models to human generated prompt-response pairs, aligning them to respond appropriately.
Reinforcement learning from human preferences on model responses further adapts models to produce favored outputs.
Rapid progress has made benchmarks obsolete quickly, requiring new coordinated benchmarking efforts.
Advances are changing products like GitHub Copilot, boosting developer productivity by over 50% in studies.
Connecting language models to search and other tools is promising but raises new research questions around reliability, behavior modeling, personalization, evaluation metrics, etc.
The incredible pace of AI progress is outpacing expectations on academic benchmarks and leading to new applications.
There are still many challenges and opportunities at this beginning stage of a new AI era that will shape future advances.
Practical applications of AI advances will increasingly impact products people use every day.
Transcripts
hello everyone my name is Ahmad I am a
researcher here at Microsoft research
today I am going to be talking about
Foundation models and the impact you are
having on the current ERA of AI
if we look back at the last 5 to 10
years AI has been making significant
impact on many perception tasks like
image and object recognition speech
recognition and most recently on
language understanding tasks where we
have been seeing different AI models
achieving Superior performance and in
many cases reaching performance equal to
what a human annotator would do on the
same task
over the last couple of years though the
frontier of AI has a change toward
generative AI we have had quite good
text generation models for some time you
could actually prompt a model with
asking it to describe an imaginary scene
and it will produce a very good
description of what you have asked it to
do
and then we started making a lot of
progress on image generation as well
with models like Dali 2 and Imogen and
even models coming out from such
startups like Med journey and stability
AI we have been getting to a level of
quality of image Generations that we
have never seen before and inspired by
that there has been also a lot of work
on animating we generated images or even
generating videos from scratch
another Frontier for generative model
has been could and not only generating
code based on text prompt but also
explaining the code or in some cases
even debugging the code
I was listening to this episode of The
Morning Edition on NPR when it aired at
the beginning of February where they
were attempting to use a bunch of AI
models for producing a schematic design
of a rocket and also for coming up with
some equations for the rocket design and
of course the hypothetical design would
have crashed and burned but I couldn't
help but think how exciting it is that
AI has become so good that we are even
attempting to measure its proficiency on
a field as complex as rocket science
if we look back we will find that there
are three main components that led to
the current performance we are seeing
from AI models the Transformer
architecture the scale and in context
learning
Transformer in particular has been
dominating the field of AI for the
previous years at the beginning this
started with natural language processing
and the architecture was very efficient
that it took over the field of natural
language processing within a very short
amount of time a Transformer is a very
efficient architecture that's easy to
scale easy to paralyze and relies on its
heart at the attention mechanism a
technique that allows us to model
interdependence between different
components or different tokens in our
input and output data
Transformers started off mostly in
natural language processing but slowly
but surely they made their way to pretty
much any modality so now we are seeing
that models that are operating on images
on videos on audio and many other
modalities are also using Transformers
five years later since the Inception and
Transformers have surprisingly changed
little compared to when they started
despite so many attempts at producing
better and more efficient variants of
Transformers perhaps because of these
gains were limited to certain use cases
or perhaps because the gains did not
persist at scale another potential
reason is that maybe the immediate
architecture less Universal which has
been one of its more uh of its biggest
advantages
the next point is scale and when we talk
about scale we really mean the amount of
compute that's being used to train the
model and that can be translated into
either training bigger and bigger models
with larger and larger number of
parameters than we have been seeing a
steady increase of that over the
previous years but skill can all could
also mean more data using more data to
train the model larger and larger
amounts of data and we have seen
different models over the previous few
years taking different approaches in
deciding how much data and how large as
the model is but the consistent trend is
that we have been scaling larger and
larger and using more and more compute
kale has also led to what is being
called as emerging capabilities and
that's one of the most interesting
properties of scale that have been
described over the previous year or so
by emerging capability we mean that the
model starts to show a certain ability
that appears only when it reaches a
critical files before that the model is
not demonstrating any of this ability at
all for example let's look at the
figures here and on the left hand side
we see arithmetics if we try to use
language models to solve arithmetic word
problems up until a certain scale we
absolutely cannot solve the problem in
any way and they do not perform any
better than random
but then at a certain critical point we
start seeing improved performance and
that performance just keeps getting
better and better and we have seen that
at so many other tasks as well ranging
from arithmetic to transliteration to
multi-task learning
and perhaps one of the most exciting
emerging capabilities of language models
recently is their ability to in context
learn which has been introducing a new
paradigm for using these models
if we take a look back at how we have
been practicing machine learning in
general with deep learning you would
start by choosing an architecture
a Transformer or before that or RNN or
scnn and then you fully supervised train
your model you have a lot of label data
and you train your model based on that
data
when we started getting into pre-trained
models we instead of training models
from scratch we actually start off with
a pre-trained model and then fine tune
it still on a lot of fully supervised
label data for the task at hand
but then with in-context learning
suddenly we can actually use the models
out of the box we can just use a Britain
model and use a prompt in order to learn
in order to perform a new task without
actually doing any learning we can do
that in zero shot settings meaning we do
not provide any examples at all just
instructions or a description of what
the task is or in a few short setting
glue we just provide a small handful
number of examples to them all
for example if we are interested in
trying to do text classification we can
just put in this case sentiment analysis
we can just provide the text to the
model and ask it to classify the text
into either positive or negative
if the task is a little bit harder we
can provide few short samples just a few
examples of how do we want the model to
classify things into say positive
negative or neutral and then ask the
model to reason about a new piece of
text and it actually does a pretty good
edit
and it's not only assembled tasks like
text classification we can do
translation or summarization and much
more complex tasks with that paradigm
we can even try to do things like
arithmetics where we try to give the
model a word problem and ask it to come
up with the answer
on the example we are showing right now
we did give the model just one sample to
show it how we would solve a problem and
then ask it to solve another problem but
in that particular case the model
actually failed it did produce an answer
but it was not the correct answer
but then came the idea of a Chain of
Thought prompts where instead of just
showing the model the input and the
output we can actually also show it the
steps it can take in order to get to
that output from that particular input
in that case we are just solving the
arithmetic word problem step by step and
showing an example of that to the model
when we do that the models are not only
able to produce the correct answer but
they are also able to walk us step by
step through how they produce that
answer
that mechanism is referred to as a Chain
of Thought prompting and it has been
very prominently used in so many tasks
in showing very Superior performance on
multiple tasks it has been also used in
many different ways including in fine
tuning and training some of the models
that pre-train and then fine-tuned
Paradigm have been established Paradigm
for years since maybe the Inception of
birth and similar pre-trained language
models but now you would see that there
has increased shift into using the
models by prompting them instead of
having to fine-tune them that's evidence
in a lot of practical usage of the
models but even in the Publications in
the machine learning areas that have
been using natural language processing
tasks and switching into using prompting
instead of using fine tuning
in context learning and prompting
matters a lot because it's actually
changing the way we apply the models to
new tasks the ability of applying the
models to new tasks out of the box
without collecting additional data
without doing any additional training is
an amazing ability that increase the
amount of tasks that can be the models
can be applied to and also reduce the
amount of effort needed into building
models with these tasks
the performance has been also amazing by
just providing only few examples
and the tasks in this setting are being
adapted to the models rather than the
models being adapted to the tasks if you
think about the fine tuning Paradigm
what we did is that we already had a
Britain model and we were fine tune it
to adapt to the task now we are trying
to frame the task in a way that's more
friendly to how the model is being
trained so that the model can perform
well on the task even without any fine
tune
finally this allows the humans to
interact with the models in their normal
form of communication in natural
language we can just give instructions
describing the tasks that we want and
the model would perform the task and
that blurses the line between who is an
ml user and who is an ml developer
because now anyone can just prompt and
describe different tasks to the language
model and get the language model to do a
large number of test screws out having
to have any training or any development
involved
now looking back at the last three
months or so we have been seeing the
field changing quite a bit and a
tremendous amount of excitement
happening around the release of the
chair gbt model
and if we think about the chair gbt
model as a generative model we would see
that there has been other generative
models out there from the GPT family and
other models as well that have been
doing a decent job at text generation so
you can take one of these models in this
case gpt3 and prompted to the question
asking it to explain what the
foundational language model means and it
would give you a pretty decent answer
you can ask the same question to check
GPT and you find that it's able to
provide a much better answer it's longer
it's more thorough it's more structured
you can ask it to style it in different
ways you can ask it to simplify it in
different ways and all of these are
capabilities that the previous
generation of the models could not
really do
if we look at how chat GPT is described
the description lists the different
things but it's mostly optimized for
dialogue allowing the humans to interact
in natural language it's much better at
following instructions and so on and so
forth if we look at step by step about
how this actually was manifested in the
training we will see from the
description that looking at base models
that Chad gbts built on and other models
before chat GPT the language model
training was following a self-supervised
brief training approach where we have a
lot of unsupervised language web scale
language that we are training the models
on and the models in this particular
case are trained with an auto regressive
next word prediction approach so we are
looking at an input context which is a
sentence or a part of a sentence and
trying to predict the next word
but then over the last year or so we
have been seeing a shift where models
are being trained not just on text but
also on code
for example gbt 3.5 models are trained
on post-text and code
and surprisingly training the models on
post-text and codes improves their
performance on many tasks that has
nothing to do with code on the figure we
see right now we see different models
being compared on models that were
trained with code and models that were
not trained with code and we are seeing
that the models that were trained with
both text and code show better
performance at following task
instructions show better performance at
reasoning compared to similar models
that were trained on text only
so the training on code seems to be
grounding the models in different ways
allowing them to learn a little bit more
about how to reason about how to look at
structured relation between different
parts of the text
the second main difference is the idea
of instruction tuning which has been
which you have been seeing becoming more
and more popular over different models
over the last year maybe starting with
instruct GPT that introduced the idea of
training the models on human generated
data and this is the departure from the
traditional self-supervised approach
where we have been only training the
models on unsupervised free unstructured
text now there is a additional step in
the training process that actually
trains the models on human generated
data the human generated data takes the
format of prompt and the response and
it's trying to teach the model to
respond in a particular way given a
problem
and this step of instruction tuning has
been actually helping the models get a
lot better especially in zero shot
performance and we see here that the
instruction tuned model tend to perform
a lot better than their non-instruction
tuned counterpart especially in zero
shot settings and the last step of the
training process introduces yet another
human generated data in this case we
actually have different responses
generated by the model and we have a
human providing preferences toward these
responses so in a sense ranking
responses and choosing which response is
better than other responses this data is
used to train a reward model that can
then be used to actually train the main
model with reinforcement learning and
this approach further aligns the model
into responding in certain ways that
correspond to the way the human has been
providing the feedback data
this notion of training the model with
human feedback data is very interesting
and it's creating a lot of traction with
many people thinking about the best
technique to train on human feedback
data the best form of human feedback to
collect to train the model on and it
will probably help us improve the models
even further in the near future
now with all these advances we have been
seeing the base of innovation and the
acceleration of the advances have been
moving so fast that it has been very
challenging in so many ways but perhaps
one of the most profound ways has been
challenging with is the notion of
benchmarking
that traditionally research in machine
learning has been very dependent on
using very solid benchmarks on measuring
the progress of different approaches but
that base of innovation has been really
challenging that recently
to understand how fast the progress has
been let's look at this data coming from
hypermind the forecasting company that
uses crowd forecasting and has been
doing that tracking some of the AI
benchmarks recently the first Benchmark
is massive multi-task language
understanding Benchmark a large
collection of language understanding
tasks
in June of 2021 a forecast was made that
in a year by June 2022 we will get to
around 57 performance on this task
but in reality what happens is that by
June 2022 we were at around 67 percent
and a couple of months later we were at
75 and we keep seeing more and more fast
improvements after that
a second task is the math test which is
a collection of middle and high school
math problems and here the prediction
was that in a year we will get to around
13 but in reality we ended up going much
more beyond that within one year and we
still see more and more advances
happening at uh faster than ever
expected pace
that rate of improvement is actually
resulting in a lot of the benchmarks
being saturated really fast if we look
back at benchmarks like M nest and
switchboard it took the community 20
plus years in order to fully saturate
these benchmarks
and that has been accelerating
accelerating to the point where now we
see benchmarks being saturated in a year
or less
in fact many of the benchmarks are
becoming obsolete to the point that only
66 percent of machine learning
benchmarks have received more than three
results at different time points and
many of them are solved or saturated
soon after they are being released
and that actually motivated the
community to come together with very
large efforts to try to design
benchmarks that are designed
specifically to check to challenge large
language models in that particular case
with big bench more than 400 authors
from our 100 institutions came together
to create it but even with such an
elaborate effort we are seeing very fast
progress and with large language models
and chain of thought-prompt things that
we discussed earlier we are seeing that
we're making very fast progress against
the hardest tasks in big bench and in
many of them models are already
performing better than humans right now
the foundation models are not only
getting better and better at benchmarks
but they are actually changing many
products that we use every day
we mentioned co-generation earlier so
let's talk a little bit about co-pilot
GitHub co-pilot is a new experience that
helps developers write code
and copilot is very interesting in many
perspective one is how fast it went from
the model being being created in
research to how to the point it made it
as a product generally available in
GitHub compiler but also in how much
user value it has been generating
this is study that was done by the
co-pilot GitHub team was looking at
quantifying the value these models were
providing to Developers
and in the first part of a study the
asked different questions to the
developers trying to assess how useful
the models are and we see that 88
percent of the participants reported
that the field eyes are much more
productive when using copilot then
before and they reported many other
positive implications or the
productivity as well
but perhaps even more interesting the
study did a controlled study where there
were two groups of developers trying to
solve the same set of tasks a group of
them had access to co-pilot and the
other group did not and interestingly
the group that had access to Pilots to
copilot not only finished the tasks at a
higher success rate but also at a much
more efficient rate overall they were 55
percent more productive 55 percent more
productivity in a coding scenario is an
amazing progress that a lot of people
would have been very surprised to think
about a model like copilot performing so
fast with such value
now Beyond code generation and text
generation another Frontier where these
models are starting to shine is when we
start connecting them with external
knowledge sources and external tools
language models that have been optimized
for dialogue have amazing language
capabilities or do really good at
understanding language at following
instructions they also do really well at
synthesizing and generating answers
there are also conversational in nature
and do store knowledge from the training
data that they were trained on but they
do have a lot of limitations around
reliability factualness Stillness access
to more recent information that was not
part of their training data provenance
and so on and that's why connecting
these models to external knowledge
sources and tools could be super
exciting
let's talk about for example connecting
language models to search as we have
seen recently with the new Bing
if we take a I look back years ago there
was many many studies studying web
search studying tasks that we will try
to complete in web search scenarios and
many of these tasks were deemed as
complex search tasks tasks that are not
navigational as in trying to go to a
particular website or that are not
simple informational tasks we are trying
to look up a fact and that you can
quickly get with one query but more
complex tasks that involve multiple
queries maybe you are planning a trap or
maybe you are trying to buy a product
and as part of your research process
there are multi-faceted queries that you
would like to look at
there has been a lot of research
understanding user Behavior with such
tasks and how prevalent they are and how
much time and efforts people spend in
order to perform them and they typically
involved with spending a significant
amount of time with the search engine
reading and synthesizing information
from different sources with different
queries
but with a new experience like the
experience bank is providing we can
actually take one of these queries and
provide much more complex long queries
to the search engine and the search
engine uses both search and cell power
of the language model to generate
multiple queries get the results of all
of these squares and since the size a
detailed answer back to the Searcher not
only that but I can recommend additional
searches and additional ways you could
interact with the search engine in order
to learn more
that has the potential of saving a lot
of time and a lot of effort for many
searches and supporting these complex
search tasks in a much better way
not only that but there are some of
these complex search tags that are
multi-step in nature where I would start
with one query and then follow up with
another query based on the information I
get from the first query imagine that I
am doing the search before the Super
Bowl where I am trying to understand
some comparison the stats between the
difference the two quarter packs that
are going to face each other and I start
with that query
what the search engine did in that
particular case is that it actually
started with a query where it was trying
to identify who is who are the two
quarterbacks that are going to be
playing in the Super Bowl and if I have
done that as a human I always have done
that I would identify the teams and the
two quarterbacks and then maybe I would
follow up with another query what I
would actually search up for the stats
of the clue quarterbacks I'm asking
about and get that and actually
synthesize the information maybe from
different results and then get to the
answer I'm looking for but with the new
bank experience I can just issue the
query and all of that is happening in
the background different search queries
are being generated submitted to the
search engine recent results are getting
collected and a single answer is being
synthesized and displayed making me as a
Searcher much more productive and much
more efficient
the potential of llm integrated large
language models integrated research and
other tools is very huge and can add
much much value to so many scenarios
but there are also a lot of challenges
and a lot of opportunities and a lot of
limitations that needs to be addressed
reliability and safety are one of them
making the models more accurate thinking
about trust provenance and bias
you usually experience and behavior and
how the new experience would affect how
the users are interacting with the
search engine is another one with new
and different tasks or different user
interfaces or even different Behavior
models search has been a very well
studied experience and we have very good
understanding of how users interact with
the search engine and very reliable
Behavior models to predict that changing
this experience will require a lot of
additional studies there
personalization and managing user
preferences and search history and so on
so forth has also been a very well
studied field in web search and with new
experiences like that we have so many
opportunities and thinking about things
like personalization and user experience
again
but also evaluation and what do metrics
mean how do we measure user satisfaction
how do we understand good and bad
abandonment good abandonment as in when
people get satisfied with the results
but you don't have to click on anything
on the search result page and bad
abandonment beings opposite of that
thinking about feedback loops which has
been playing a large button improving
search engines and how can we apply them
to new experiences and new scenarios so
while integrating language models with
an experience like search and other
tools and experiences is very exciting
it's actually also creating so many
opportunities for new research problems
or for revisiting previous search
problems that we had very good
understanding for to conclude we have
been seeing incredible advancing with AI
over the past couple of years the
progress has been accelerating and
outpacing expectations in so many ways
and the advances are not only in terms
of academic benchmarks and Publications
but we are also seeing an explosion of
applications that are changing the
products that we use every day
however we are really much closer to the
beginning of a new era with AI than we
are to the end state of AI capabilities
there are so many opportunities and we
will probably see a lot more advances
and even more accelerated progress over
the new the coming months and years
and there are so many challenges that
remain and many new opportunities that
are arising because of the state of
where these models are
it's a very exciting time for AI and we
are really looking forward to see the
advances that will happen moving forward
and to the applications that will result
from these advances and housing will
affect every one of us with the products
we use every day thank you so much
Просмотреть больше связанных видео
5.0 / 5 (0 votes)