Foundation models and the next era of AI

Microsoft Research
23 Mar 202328:37

Summary

TLDRThe video discusses recent advances in AI, focusing on large language models like GPT-3 and ClaudeAI's ChatGPT. It outlines key innovations enabling progress: Transformer architectures, massive scale, and few-shot in-context learning. Models now solve complex benchmarks rapidly and power products like GitHub Copilot. But open challenges remain around trust, safety, personalization and more. We are still early in realizing AI's full potential; more accelerated progress lies ahead as models integrate with search, tools and experiences - creating ample research opportunities.

Takeaways

  • 😲 AI models have made huge advances in generative capabilities recently, with high quality text, image, video and code generation
  • 😎 Transformers have come to dominate much of AI, with their efficiency, scalability and attention mechanism
  • 🚀 Scale of models and data keeps increasing, revealing powerful 'emerging capabilities' at critical mass
  • 💡 In-context learning allows models to perform new tasks well with no gradient updates, just prompts
  • 👍 Chain of Thought prompting guides models to reason step-by-step, greatly improving performance
  • 📈 Benchmarks are being solved rapidly, requiring constant refresh and expansion to track progress
  • 🤖 Large language models integrated into products are transforming user experiences e.g. GitHub Copilot
  • 🔎 Integrating LLMs with search and other tools has huge potential but also poses big challenges
  • ☁️ We're still at the very start of realizing AI's capabilities - more advances coming quickly!
  • 😊 AI progress is accelerating and affecting everyday products - exciting times ahead!

Q & A

  • What architectural innovation allowed AI models to achieve superior performance on perception tasks?

    -The Transformer architecture, which relies on an attention mechanism to model interdependence between different components in the input and output.

  • How did the introduction of in-context learning change the way AI models can be applied to new tasks?

    -In-context learning allows models to perform new tasks directly from pretrained versions, without additional training data or tuning. This expands the range of possible applications and reduces the effort to deploy models on new tasks.

  • What training innovations were introduced in ChatGPT compared to previous self-supervised models?

    -ChatGPT introduced instruction tuning on human-generated prompt-response examples and reinforcement learning from human preferences over model responses.

  • Why is benchmarking progress in AI becoming increasingly challenging?

    -Benchmarks are being solved at an accelerating pace by advancing models, often within months or even weeks of release, limiting their usefulness for measuring ongoing progress.

  • How does GitHub Copilot demonstrate the rapid transition of AI from research to product?

    -GitHub launched Copilot, which assists developers by generating code, shortly after the underlying AI model was created. Studies show it makes developers 55% more productive on coding tasks.

  • What are some limitations of language models that can be addressed by connecting them with search engines or other external tools?

    -Language models have limitations relating to reliability, factual correctness, access to recent information, provenance tracking, etc. Connecting them to search engines and knowledge bases can provide missing capabilities.

  • What user experience challenges are introduced when language models are integrated into existing products like search engines?

    -Challenges include revisiting assumptions about metrics, evaluation, personalization, user satisfaction, intended usage patterns, unintended behavior changes, and how to close feedback loops.

  • What evidence suggests we are still in the early stages of realizing AI's capabilities?

    -The rapid pace of recent innovations, waves of new applications to transform existing products, and many remaining open challenges around aspects like safety and reliability indicate the technology still has far to progress.

  • How did training language models jointly on text and code give better performance?

    -Training on code appeared to help ground models' reasoning and understanding of structured relationships between elements, transfering benefits to other language tasks.

  • What techniques have researchers proposed for further improvements by training AI systems on human feedback?

    -Ideas include prompt-based training, preference learning over model responses, and reinforcement learning from human judgments.

Outlines

00:00

📈 Overview of AI progress in recent years

The paragraph provides an overview of the major advances in AI over the past 5-10 years. It discusses progress in perception tasks like image/speech recognition and then shifts to recent breakthroughs in generative AI for text, images and video. It highlights models like DALL-E 2, Imagen, Stable Diffusion that showcase quality image generation capabilities.

05:01

📊 Key factors behind current AI capabilities

The paragraph discusses 3 key factors that have led to current AI capabilities - Transformer architectures, scale/compute, and in-context learning. It provides details on how Transformers have come to dominate NLP and other modalities. It also covers how scale leads to emerging capabilities, using arithmetic word problem solving as an example that shows performance jumps at a critical scale.

10:02

👩‍💻 New paradigm of in-context learning

The paragraph explains the shift to in-context learning, where pre-trained models can be used to perform new tasks just with prompting and examples instead of fine-tuning. This reduces data needs, effort, and allows models to be applied to more tasks. Performance has been strong in few-shot settings across various tasks. It adapts tasks to models vs models to tasks.

15:04

🤖 Novel aspects of ChatGPT training process

The paragraph analyzes key aspects of ChatGPT's training - use of text and code, instruction tuning on human demonstrations, and reinforcement learning from human preferences. These align the model better to generate high-quality responses tuned to human judgments and interactions. Training on code especially helps with following instructions and reasoning.

20:06

🚀 Foundation models transforming products

The paragraph discusses examples of foundation models driving impact in products - GitHub Copilot for coding and Bing search integration. Studies show Copilot drives 55% higher developer productivity. Search integration handles complex, multi-step tasks automatically by orchestrating queries in the background and synthesizing an answer.

25:07

🌄 Opportunities and challenges moving forward

The paragraph concludes by summarizing tremendous progress but noting we are still early in realizing AI's full potential. Many challenges remain around trust, bias, user experience evaluation etc. But also opportunities to improve models further and apply them to transform more products used daily.

Mindmap

Keywords

💡Foundation models

Foundation models refer to large, pretrained language models that can be adapted to many downstream tasks through transfer learning. As discussed in the video, foundation models like GPT-3 and Jurassic have enabled new capabilities like few-shot learning and in-context learning. Their scale and versatility make them a foundation for further AI progress.

💡Transformers

Transformers are a type of neural network architecture that has become dominant in NLP tasks. As stated in the video, transformers are efficient, easy to scale and parallelize, and rely on an attention mechanism to model relationships between input and output tokens. They now are being applied not just in NLP but also for images, video, and other modalities.

💡Scale

Scale refers to the computational resources used to train AI models - both in terms of model size (number of parameters) and amount of data used for training. As discussed in the video, scale unlocks emerging capabilities at critical points - for example, arithmetic word problem solving sharply improves above a certain scale threshold.

💡In-context learning

In-context learning is a new paradigm enabled by foundation models where they can be applied to downstream tasks by simply providing a natural language prompt describing the task, without any gradient-based fine-tuning. As noted in the video, this blurs the line between ML developers and users.

💡Instruction tuning

Instruction tuning refers to additional training of foundation models on human-generated demonstration data in a prompt-response format. As stated, this further aligns models to generate responses suited for the task based on instructions, improving zero-shot capabilities.

💡Reinforcement learning

Reinforcement learning trains models to choose actions based on feedback on prior actions. As discussed, foundation models like Jurassic used human preferences on model responses to train a reward model via RL, further optimizing the models.

💡Benchmarking

Measuring progress in AI through standardized benchmarks has been integral to research but rapid advances have resulted in saturation of benchmarks. Creating comprehensive benchmarks like BIG-bench was an attempt to measure foundation model capabilities.

💡GitHub Copilot

GitHub Copilot is a prime example of rapid deployment of foundation models discussed in the video - it went from research concept to widely used tool providing 55% productivity gains for developers through code suggestions.

💡Search

Integrating foundation models with search and other tools to provide multi-turn conversational search experiences is an exciting opportunity highlighted. It can enhance complex, multi-step search tasks by automatically generating queries and summarizing information.

💡Challenges

As noted in concluding remarks, foundation models introduce challenges around safety, fairness, provenance etc. while evaluation metrics and user experience considerations for search integration are some research opportunities.

Highlights

AI has been making significant impact on perception tasks like image recognition, speech recognition and language understanding.

The frontier of AI has changed toward generative AI, with progress in areas like text generation, image generation, video generation and code generation.

Transformers have been dominating the field of AI, relying on the attention mechanism to model interdependence between input and output data.

Scale of compute used for training has led to emerging capabilities, where models demonstrate new abilities only when reaching critical mass.

In-context learning allows models to perform new tasks out of the box without additional data or training, just using prompts.

Chain of Thought prompting shows models the steps to solve a problem, significantly improving performance on complex tasks.

Training language models on both text and code seems to ground them, allowing better reasoning and understanding structured relations.

Instruction tuning exposes models to human generated prompt-response pairs, aligning them to respond appropriately.

Reinforcement learning from human preferences on model responses further adapts models to produce favored outputs.

Rapid progress has made benchmarks obsolete quickly, requiring new coordinated benchmarking efforts.

Advances are changing products like GitHub Copilot, boosting developer productivity by over 50% in studies.

Connecting language models to search and other tools is promising but raises new research questions around reliability, behavior modeling, personalization, evaluation metrics, etc.

The incredible pace of AI progress is outpacing expectations on academic benchmarks and leading to new applications.

There are still many challenges and opportunities at this beginning stage of a new AI era that will shape future advances.

Practical applications of AI advances will increasingly impact products people use every day.

Transcripts

play00:00

hello everyone my name is Ahmad I am a

play00:03

researcher here at Microsoft research

play00:04

today I am going to be talking about

play00:06

Foundation models and the impact you are

play00:08

having on the current ERA of AI

play00:11

if we look back at the last 5 to 10

play00:14

years AI has been making significant

play00:16

impact on many perception tasks like

play00:19

image and object recognition speech

play00:21

recognition and most recently on

play00:23

language understanding tasks where we

play00:26

have been seeing different AI models

play00:28

achieving Superior performance and in

play00:31

many cases reaching performance equal to

play00:33

what a human annotator would do on the

play00:35

same task

play00:38

over the last couple of years though the

play00:40

frontier of AI has a change toward

play00:42

generative AI we have had quite good

play00:45

text generation models for some time you

play00:48

could actually prompt a model with

play00:50

asking it to describe an imaginary scene

play00:53

and it will produce a very good

play00:55

description of what you have asked it to

play00:57

do

play00:58

and then we started making a lot of

play01:00

progress on image generation as well

play01:01

with models like Dali 2 and Imogen and

play01:06

even models coming out from such

play01:08

startups like Med journey and stability

play01:10

AI we have been getting to a level of

play01:12

quality of image Generations that we

play01:14

have never seen before and inspired by

play01:16

that there has been also a lot of work

play01:18

on animating we generated images or even

play01:21

generating videos from scratch

play01:24

another Frontier for generative model

play01:27

has been could and not only generating

play01:30

code based on text prompt but also

play01:33

explaining the code or in some cases

play01:35

even debugging the code

play01:38

I was listening to this episode of The

play01:40

Morning Edition on NPR when it aired at

play01:43

the beginning of February where they

play01:45

were attempting to use a bunch of AI

play01:47

models for producing a schematic design

play01:50

of a rocket and also for coming up with

play01:53

some equations for the rocket design and

play01:55

of course the hypothetical design would

play01:57

have crashed and burned but I couldn't

play01:59

help but think how exciting it is that

play02:02

AI has become so good that we are even

play02:05

attempting to measure its proficiency on

play02:08

a field as complex as rocket science

play02:12

if we look back we will find that there

play02:14

are three main components that led to

play02:16

the current performance we are seeing

play02:19

from AI models the Transformer

play02:21

architecture the scale and in context

play02:24

learning

play02:25

Transformer in particular has been

play02:27

dominating the field of AI for the

play02:30

previous years at the beginning this

play02:32

started with natural language processing

play02:34

and the architecture was very efficient

play02:37

that it took over the field of natural

play02:39

language processing within a very short

play02:41

amount of time a Transformer is a very

play02:44

efficient architecture that's easy to

play02:46

scale easy to paralyze and relies on its

play02:49

heart at the attention mechanism a

play02:52

technique that allows us to model

play02:54

interdependence between different

play02:56

components or different tokens in our

play02:58

input and output data

play03:00

Transformers started off mostly in

play03:03

natural language processing but slowly

play03:05

but surely they made their way to pretty

play03:07

much any modality so now we are seeing

play03:10

that models that are operating on images

play03:13

on videos on audio and many other

play03:16

modalities are also using Transformers

play03:20

five years later since the Inception and

play03:23

Transformers have surprisingly changed

play03:25

little compared to when they started

play03:28

despite so many attempts at producing

play03:31

better and more efficient variants of

play03:33

Transformers perhaps because of these

play03:36

gains were limited to certain use cases

play03:38

or perhaps because the gains did not

play03:41

persist at scale another potential

play03:43

reason is that maybe the immediate

play03:45

architecture less Universal which has

play03:47

been one of its more uh of its biggest

play03:50

advantages

play03:53

the next point is scale and when we talk

play03:57

about scale we really mean the amount of

play03:59

compute that's being used to train the

play04:01

model and that can be translated into

play04:03

either training bigger and bigger models

play04:06

with larger and larger number of

play04:07

parameters than we have been seeing a

play04:09

steady increase of that over the

play04:11

previous years but skill can all could

play04:14

also mean more data using more data to

play04:16

train the model larger and larger

play04:18

amounts of data and we have seen

play04:20

different models over the previous few

play04:22

years taking different approaches in

play04:25

deciding how much data and how large as

play04:27

the model is but the consistent trend is

play04:30

that we have been scaling larger and

play04:32

larger and using more and more compute

play04:36

kale has also led to what is being

play04:38

called as emerging capabilities and

play04:41

that's one of the most interesting

play04:42

properties of scale that have been

play04:44

described over the previous year or so

play04:47

by emerging capability we mean that the

play04:50

model starts to show a certain ability

play04:52

that appears only when it reaches a

play04:55

critical files before that the model is

play04:58

not demonstrating any of this ability at

play05:01

all for example let's look at the

play05:03

figures here and on the left hand side

play05:05

we see arithmetics if we try to use

play05:08

language models to solve arithmetic word

play05:10

problems up until a certain scale we

play05:14

absolutely cannot solve the problem in

play05:16

any way and they do not perform any

play05:18

better than random

play05:19

but then at a certain critical point we

play05:22

start seeing improved performance and

play05:24

that performance just keeps getting

play05:26

better and better and we have seen that

play05:28

at so many other tasks as well ranging

play05:30

from arithmetic to transliteration to

play05:33

multi-task learning

play05:37

and perhaps one of the most exciting

play05:40

emerging capabilities of language models

play05:43

recently is their ability to in context

play05:46

learn which has been introducing a new

play05:48

paradigm for using these models

play05:51

if we take a look back at how we have

play05:54

been practicing machine learning in

play05:55

general with deep learning you would

play05:57

start by choosing an architecture

play06:00

a Transformer or before that or RNN or

play06:03

scnn and then you fully supervised train

play06:06

your model you have a lot of label data

play06:08

and you train your model based on that

play06:11

data

play06:12

when we started getting into pre-trained

play06:14

models we instead of training models

play06:17

from scratch we actually start off with

play06:19

a pre-trained model and then fine tune

play06:21

it still on a lot of fully supervised

play06:24

label data for the task at hand

play06:27

but then with in-context learning

play06:29

suddenly we can actually use the models

play06:31

out of the box we can just use a Britain

play06:34

model and use a prompt in order to learn

play06:37

in order to perform a new task without

play06:39

actually doing any learning we can do

play06:42

that in zero shot settings meaning we do

play06:45

not provide any examples at all just

play06:47

instructions or a description of what

play06:49

the task is or in a few short setting

play06:51

glue we just provide a small handful

play06:54

number of examples to them all

play06:57

for example if we are interested in

play06:59

trying to do text classification we can

play07:01

just put in this case sentiment analysis

play07:04

we can just provide the text to the

play07:05

model and ask it to classify the text

play07:08

into either positive or negative

play07:11

if the task is a little bit harder we

play07:13

can provide few short samples just a few

play07:16

examples of how do we want the model to

play07:19

classify things into say positive

play07:21

negative or neutral and then ask the

play07:23

model to reason about a new piece of

play07:26

text and it actually does a pretty good

play07:28

edit

play07:29

and it's not only assembled tasks like

play07:31

text classification we can do

play07:33

translation or summarization and much

play07:35

more complex tasks with that paradigm

play07:39

we can even try to do things like

play07:41

arithmetics where we try to give the

play07:43

model a word problem and ask it to come

play07:45

up with the answer

play07:47

on the example we are showing right now

play07:49

we did give the model just one sample to

play07:51

show it how we would solve a problem and

play07:54

then ask it to solve another problem but

play07:56

in that particular case the model

play07:57

actually failed it did produce an answer

play07:59

but it was not the correct answer

play08:03

but then came the idea of a Chain of

play08:05

Thought prompts where instead of just

play08:08

showing the model the input and the

play08:10

output we can actually also show it the

play08:13

steps it can take in order to get to

play08:15

that output from that particular input

play08:17

in that case we are just solving the

play08:20

arithmetic word problem step by step and

play08:22

showing an example of that to the model

play08:25

when we do that the models are not only

play08:27

able to produce the correct answer but

play08:29

they are also able to walk us step by

play08:32

step through how they produce that

play08:34

answer

play08:35

that mechanism is referred to as a Chain

play08:37

of Thought prompting and it has been

play08:39

very prominently used in so many tasks

play08:42

in showing very Superior performance on

play08:44

multiple tasks it has been also used in

play08:47

many different ways including in fine

play08:49

tuning and training some of the models

play08:53

that pre-train and then fine-tuned

play08:56

Paradigm have been established Paradigm

play08:58

for years since maybe the Inception of

play09:00

birth and similar pre-trained language

play09:02

models but now you would see that there

play09:05

has increased shift into using the

play09:07

models by prompting them instead of

play09:09

having to fine-tune them that's evidence

play09:12

in a lot of practical usage of the

play09:14

models but even in the Publications in

play09:17

the machine learning areas that have

play09:18

been using natural language processing

play09:20

tasks and switching into using prompting

play09:23

instead of using fine tuning

play09:26

in context learning and prompting

play09:28

matters a lot because it's actually

play09:31

changing the way we apply the models to

play09:34

new tasks the ability of applying the

play09:37

models to new tasks out of the box

play09:39

without collecting additional data

play09:41

without doing any additional training is

play09:43

an amazing ability that increase the

play09:46

amount of tasks that can be the models

play09:49

can be applied to and also reduce the

play09:52

amount of effort needed into building

play09:54

models with these tasks

play09:57

the performance has been also amazing by

play09:59

just providing only few examples

play10:02

and the tasks in this setting are being

play10:04

adapted to the models rather than the

play10:06

models being adapted to the tasks if you

play10:09

think about the fine tuning Paradigm

play10:11

what we did is that we already had a

play10:13

Britain model and we were fine tune it

play10:15

to adapt to the task now we are trying

play10:17

to frame the task in a way that's more

play10:20

friendly to how the model is being

play10:22

trained so that the model can perform

play10:24

well on the task even without any fine

play10:26

tune

play10:28

finally this allows the humans to

play10:31

interact with the models in their normal

play10:33

form of communication in natural

play10:35

language we can just give instructions

play10:37

describing the tasks that we want and

play10:39

the model would perform the task and

play10:42

that blurses the line between who is an

play10:45

ml user and who is an ml developer

play10:47

because now anyone can just prompt and

play10:49

describe different tasks to the language

play10:51

model and get the language model to do a

play10:54

large number of test screws out having

play10:56

to have any training or any development

play10:59

involved

play11:02

now looking back at the last three

play11:05

months or so we have been seeing the

play11:07

field changing quite a bit and a

play11:09

tremendous amount of excitement

play11:11

happening around the release of the

play11:13

chair gbt model

play11:15

and if we think about the chair gbt

play11:17

model as a generative model we would see

play11:19

that there has been other generative

play11:21

models out there from the GPT family and

play11:24

other models as well that have been

play11:26

doing a decent job at text generation so

play11:29

you can take one of these models in this

play11:30

case gpt3 and prompted to the question

play11:34

asking it to explain what the

play11:36

foundational language model means and it

play11:38

would give you a pretty decent answer

play11:41

you can ask the same question to check

play11:43

GPT and you find that it's able to

play11:45

provide a much better answer it's longer

play11:48

it's more thorough it's more structured

play11:50

you can ask it to style it in different

play11:52

ways you can ask it to simplify it in

play11:55

different ways and all of these are

play11:57

capabilities that the previous

play11:58

generation of the models could not

play12:00

really do

play12:02

if we look at how chat GPT is described

play12:06

the description lists the different

play12:08

things but it's mostly optimized for

play12:10

dialogue allowing the humans to interact

play12:13

in natural language it's much better at

play12:15

following instructions and so on and so

play12:17

forth if we look at step by step about

play12:20

how this actually was manifested in the

play12:22

training we will see from the

play12:24

description that looking at base models

play12:27

that Chad gbts built on and other models

play12:30

before chat GPT the language model

play12:32

training was following a self-supervised

play12:35

brief training approach where we have a

play12:37

lot of unsupervised language web scale

play12:40

language that we are training the models

play12:43

on and the models in this particular

play12:45

case are trained with an auto regressive

play12:48

next word prediction approach so we are

play12:50

looking at an input context which is a

play12:54

sentence or a part of a sentence and

play12:55

trying to predict the next word

play12:59

but then over the last year or so we

play13:01

have been seeing a shift where models

play13:03

are being trained not just on text but

play13:06

also on code

play13:07

for example gbt 3.5 models are trained

play13:11

on post-text and code

play13:13

and surprisingly training the models on

play13:16

post-text and codes improves their

play13:18

performance on many tasks that has

play13:20

nothing to do with code on the figure we

play13:23

see right now we see different models

play13:25

being compared on models that were

play13:27

trained with code and models that were

play13:29

not trained with code and we are seeing

play13:31

that the models that were trained with

play13:33

both text and code show better

play13:35

performance at following task

play13:37

instructions show better performance at

play13:40

reasoning compared to similar models

play13:42

that were trained on text only

play13:44

so the training on code seems to be

play13:46

grounding the models in different ways

play13:48

allowing them to learn a little bit more

play13:50

about how to reason about how to look at

play13:53

structured relation between different

play13:55

parts of the text

play13:57

the second main difference is the idea

play14:00

of instruction tuning which has been

play14:03

which you have been seeing becoming more

play14:05

and more popular over different models

play14:07

over the last year maybe starting with

play14:09

instruct GPT that introduced the idea of

play14:13

training the models on human generated

play14:15

data and this is the departure from the

play14:18

traditional self-supervised approach

play14:20

where we have been only training the

play14:22

models on unsupervised free unstructured

play14:24

text now there is a additional step in

play14:28

the training process that actually

play14:30

trains the models on human generated

play14:32

data the human generated data takes the

play14:35

format of prompt and the response and

play14:38

it's trying to teach the model to

play14:40

respond in a particular way given a

play14:42

problem

play14:43

and this step of instruction tuning has

play14:47

been actually helping the models get a

play14:49

lot better especially in zero shot

play14:51

performance and we see here that the

play14:54

instruction tuned model tend to perform

play14:56

a lot better than their non-instruction

play14:58

tuned counterpart especially in zero

play15:01

shot settings and the last step of the

play15:04

training process introduces yet another

play15:06

human generated data in this case we

play15:10

actually have different responses

play15:11

generated by the model and we have a

play15:13

human providing preferences toward these

play15:16

responses so in a sense ranking

play15:18

responses and choosing which response is

play15:20

better than other responses this data is

play15:23

used to train a reward model that can

play15:25

then be used to actually train the main

play15:28

model with reinforcement learning and

play15:30

this approach further aligns the model

play15:32

into responding in certain ways that

play15:36

correspond to the way the human has been

play15:38

providing the feedback data

play15:40

this notion of training the model with

play15:42

human feedback data is very interesting

play15:44

and it's creating a lot of traction with

play15:46

many people thinking about the best

play15:48

technique to train on human feedback

play15:50

data the best form of human feedback to

play15:52

collect to train the model on and it

play15:54

will probably help us improve the models

play15:56

even further in the near future

play16:00

now with all these advances we have been

play16:03

seeing the base of innovation and the

play16:05

acceleration of the advances have been

play16:07

moving so fast that it has been very

play16:10

challenging in so many ways but perhaps

play16:13

one of the most profound ways has been

play16:15

challenging with is the notion of

play16:17

benchmarking

play16:18

that traditionally research in machine

play16:20

learning has been very dependent on

play16:23

using very solid benchmarks on measuring

play16:26

the progress of different approaches but

play16:29

that base of innovation has been really

play16:31

challenging that recently

play16:34

to understand how fast the progress has

play16:37

been let's look at this data coming from

play16:39

hypermind the forecasting company that

play16:42

uses crowd forecasting and has been

play16:44

doing that tracking some of the AI

play16:46

benchmarks recently the first Benchmark

play16:48

is massive multi-task language

play16:50

understanding Benchmark a large

play16:53

collection of language understanding

play16:55

tasks

play16:56

in June of 2021 a forecast was made that

play16:59

in a year by June 2022 we will get to

play17:03

around 57 performance on this task

play17:07

but in reality what happens is that by

play17:09

June 2022 we were at around 67 percent

play17:13

and a couple of months later we were at

play17:15

75 and we keep seeing more and more fast

play17:18

improvements after that

play17:21

a second task is the math test which is

play17:24

a collection of middle and high school

play17:26

math problems and here the prediction

play17:28

was that in a year we will get to around

play17:30

13 but in reality we ended up going much

play17:34

more beyond that within one year and we

play17:37

still see more and more advances

play17:38

happening at uh faster than ever

play17:40

expected pace

play17:43

that rate of improvement is actually

play17:47

resulting in a lot of the benchmarks

play17:49

being saturated really fast if we look

play17:52

back at benchmarks like M nest and

play17:54

switchboard it took the community 20

play17:57

plus years in order to fully saturate

play18:00

these benchmarks

play18:01

and that has been accelerating

play18:03

accelerating to the point where now we

play18:05

see benchmarks being saturated in a year

play18:07

or less

play18:09

in fact many of the benchmarks are

play18:12

becoming obsolete to the point that only

play18:14

66 percent of machine learning

play18:17

benchmarks have received more than three

play18:19

results at different time points and

play18:22

many of them are solved or saturated

play18:24

soon after they are being released

play18:28

and that actually motivated the

play18:30

community to come together with very

play18:32

large efforts to try to design

play18:34

benchmarks that are designed

play18:36

specifically to check to challenge large

play18:38

language models in that particular case

play18:40

with big bench more than 400 authors

play18:43

from our 100 institutions came together

play18:46

to create it but even with such an

play18:49

elaborate effort we are seeing very fast

play18:51

progress and with large language models

play18:54

and chain of thought-prompt things that

play18:55

we discussed earlier we are seeing that

play18:58

we're making very fast progress against

play19:00

the hardest tasks in big bench and in

play19:02

many of them models are already

play19:03

performing better than humans right now

play19:10

the foundation models are not only

play19:13

getting better and better at benchmarks

play19:15

but they are actually changing many

play19:17

products that we use every day

play19:20

we mentioned co-generation earlier so

play19:23

let's talk a little bit about co-pilot

play19:25

GitHub co-pilot is a new experience that

play19:28

helps developers write code

play19:31

and copilot is very interesting in many

play19:34

perspective one is how fast it went from

play19:37

the model being being created in

play19:39

research to how to the point it made it

play19:42

as a product generally available in

play19:45

GitHub compiler but also in how much

play19:47

user value it has been generating

play19:50

this is study that was done by the

play19:52

co-pilot GitHub team was looking at

play19:54

quantifying the value these models were

play19:57

providing to Developers

play19:59

and in the first part of a study the

play20:01

asked different questions to the

play20:03

developers trying to assess how useful

play20:06

the models are and we see that 88

play20:09

percent of the participants reported

play20:11

that the field eyes are much more

play20:13

productive when using copilot then

play20:15

before and they reported many other

play20:17

positive implications or the

play20:19

productivity as well

play20:21

but perhaps even more interesting the

play20:24

study did a controlled study where there

play20:26

were two groups of developers trying to

play20:29

solve the same set of tasks a group of

play20:32

them had access to co-pilot and the

play20:34

other group did not and interestingly

play20:37

the group that had access to Pilots to

play20:39

copilot not only finished the tasks at a

play20:42

higher success rate but also at a much

play20:45

more efficient rate overall they were 55

play20:49

percent more productive 55 percent more

play20:53

productivity in a coding scenario is an

play20:56

amazing progress that a lot of people

play20:58

would have been very surprised to think

play21:01

about a model like copilot performing so

play21:04

fast with such value

play21:09

now Beyond code generation and text

play21:12

generation another Frontier where these

play21:16

models are starting to shine is when we

play21:18

start connecting them with external

play21:20

knowledge sources and external tools

play21:24

language models that have been optimized

play21:26

for dialogue have amazing language

play21:28

capabilities or do really good at

play21:31

understanding language at following

play21:33

instructions they also do really well at

play21:36

synthesizing and generating answers

play21:38

there are also conversational in nature

play21:41

and do store knowledge from the training

play21:45

data that they were trained on but they

play21:47

do have a lot of limitations around

play21:49

reliability factualness Stillness access

play21:52

to more recent information that was not

play21:54

part of their training data provenance

play21:56

and so on and that's why connecting

play21:59

these models to external knowledge

play22:01

sources and tools could be super

play22:03

exciting

play22:05

let's talk about for example connecting

play22:07

language models to search as we have

play22:09

seen recently with the new Bing

play22:14

if we take a I look back years ago there

play22:17

was many many studies studying web

play22:19

search studying tasks that we will try

play22:22

to complete in web search scenarios and

play22:25

many of these tasks were deemed as

play22:27

complex search tasks tasks that are not

play22:30

navigational as in trying to go to a

play22:32

particular website or that are not

play22:35

simple informational tasks we are trying

play22:37

to look up a fact and that you can

play22:40

quickly get with one query but more

play22:43

complex tasks that involve multiple

play22:45

queries maybe you are planning a trap or

play22:48

maybe you are trying to buy a product

play22:49

and as part of your research process

play22:52

there are multi-faceted queries that you

play22:54

would like to look at

play22:56

there has been a lot of research

play22:59

understanding user Behavior with such

play23:01

tasks and how prevalent they are and how

play23:03

much time and efforts people spend in

play23:06

order to perform them and they typically

play23:08

involved with spending a significant

play23:10

amount of time with the search engine

play23:12

reading and synthesizing information

play23:14

from different sources with different

play23:16

queries

play23:19

but with a new experience like the

play23:21

experience bank is providing we can

play23:23

actually take one of these queries and

play23:25

provide much more complex long queries

play23:28

to the search engine and the search

play23:30

engine uses both search and cell power

play23:32

of the language model to generate

play23:34

multiple queries get the results of all

play23:37

of these squares and since the size a

play23:40

detailed answer back to the Searcher not

play23:42

only that but I can recommend additional

play23:45

searches and additional ways you could

play23:47

interact with the search engine in order

play23:50

to learn more

play23:51

that has the potential of saving a lot

play23:54

of time and a lot of effort for many

play23:56

searches and supporting these complex

play23:58

search tasks in a much better way

play24:01

not only that but there are some of

play24:03

these complex search tags that are

play24:05

multi-step in nature where I would start

play24:08

with one query and then follow up with

play24:10

another query based on the information I

play24:12

get from the first query imagine that I

play24:14

am doing the search before the Super

play24:17

Bowl where I am trying to understand

play24:19

some comparison the stats between the

play24:22

difference the two quarter packs that

play24:23

are going to face each other and I start

play24:26

with that query

play24:28

what the search engine did in that

play24:30

particular case is that it actually

play24:32

started with a query where it was trying

play24:34

to identify who is who are the two

play24:36

quarterbacks that are going to be

play24:38

playing in the Super Bowl and if I have

play24:40

done that as a human I always have done

play24:42

that I would identify the teams and the

play24:43

two quarterbacks and then maybe I would

play24:46

follow up with another query what I

play24:48

would actually search up for the stats

play24:50

of the clue quarterbacks I'm asking

play24:52

about and get that and actually

play24:54

synthesize the information maybe from

play24:56

different results and then get to the

play24:58

answer I'm looking for but with the new

play25:01

bank experience I can just issue the

play25:03

query and all of that is happening in

play25:04

the background different search queries

play25:06

are being generated submitted to the

play25:08

search engine recent results are getting

play25:11

collected and a single answer is being

play25:13

synthesized and displayed making me as a

play25:17

Searcher much more productive and much

play25:18

more efficient

play25:21

the potential of llm integrated large

play25:25

language models integrated research and

play25:27

other tools is very huge and can add

play25:29

much much value to so many scenarios

play25:33

but there are also a lot of challenges

play25:35

and a lot of opportunities and a lot of

play25:37

limitations that needs to be addressed

play25:39

reliability and safety are one of them

play25:41

making the models more accurate thinking

play25:44

about trust provenance and bias

play25:46

you usually experience and behavior and

play25:49

how the new experience would affect how

play25:52

the users are interacting with the

play25:53

search engine is another one with new

play25:56

and different tasks or different user

play25:58

interfaces or even different Behavior

play26:00

models search has been a very well

play26:02

studied experience and we have very good

play26:05

understanding of how users interact with

play26:08

the search engine and very reliable

play26:10

Behavior models to predict that changing

play26:13

this experience will require a lot of

play26:15

additional studies there

play26:18

personalization and managing user

play26:20

preferences and search history and so on

play26:22

so forth has also been a very well

play26:24

studied field in web search and with new

play26:26

experiences like that we have so many

play26:29

opportunities and thinking about things

play26:31

like personalization and user experience

play26:33

again

play26:34

but also evaluation and what do metrics

play26:37

mean how do we measure user satisfaction

play26:39

how do we understand good and bad

play26:41

abandonment good abandonment as in when

play26:44

people get satisfied with the results

play26:46

but you don't have to click on anything

play26:48

on the search result page and bad

play26:50

abandonment beings opposite of that

play26:53

thinking about feedback loops which has

play26:55

been playing a large button improving

play26:57

search engines and how can we apply them

play26:59

to new experiences and new scenarios so

play27:03

while integrating language models with

play27:05

an experience like search and other

play27:07

tools and experiences is very exciting

play27:10

it's actually also creating so many

play27:13

opportunities for new research problems

play27:15

or for revisiting previous search

play27:18

problems that we had very good

play27:19

understanding for to conclude we have

play27:22

been seeing incredible advancing with AI

play27:25

over the past couple of years the

play27:28

progress has been accelerating and

play27:29

outpacing expectations in so many ways

play27:33

and the advances are not only in terms

play27:36

of academic benchmarks and Publications

play27:38

but we are also seeing an explosion of

play27:41

applications that are changing the

play27:44

products that we use every day

play27:47

however we are really much closer to the

play27:50

beginning of a new era with AI than we

play27:52

are to the end state of AI capabilities

play27:55

there are so many opportunities and we

play27:58

will probably see a lot more advances

play28:00

and even more accelerated progress over

play28:03

the new the coming months and years

play28:06

and there are so many challenges that

play28:08

remain and many new opportunities that

play28:10

are arising because of the state of

play28:12

where these models are

play28:14

it's a very exciting time for AI and we

play28:18

are really looking forward to see the

play28:20

advances that will happen moving forward

play28:22

and to the applications that will result

play28:24

from these advances and housing will

play28:26

affect every one of us with the products

play28:28

we use every day thank you so much