Andrew Ng - Why Data Engineering is Critical to Data-Centric AI

Monday Morning Data Chat
16 Sept 202427:45

Summary

TLDRIn this podcast, Andrew and Joe discuss the paradigm shift from model-centric to data-centric AI. They explore how data engineering is pivotal to successful AI implementation, especially with the rise of generative AI and large language models. Andrew highlights the importance of curating high-quality data for training and fine-tuning models. The conversation also touches on emerging AI applications in education and the future of work, emphasizing the potential for generative AI to transform these sectors.

Takeaways

  • 📈 Data Centric AI is gaining momentum as a shift from the traditional model-centric approach, emphasizing the importance of data quality and engineering.
  • 🔄 The process of training AI models involves a significant amount of data handling, from selection to fine-tuning, which is often more labor-intensive than model engineering itself.
  • 🌐 Data issues are prevalent in generative AI, affecting training, fine-tuning, and practical deployment, highlighting the need for robust data strategies.
  • 🔄 An example of data curation in AI is Meta's use of an earlier model, llama 3, to generate coding puzzles that trained the subsequent model, llama 3.1.
  • 📚 Licensing and procurement of data are significant challenges, especially when aligning data with human values and labeling schemas.
  • 🔧 Data engineering is mission-critical for AI, with decisions around data storage, cloud services, and database schemes being vital for performance.
  • 🏢 Companies are urged to invest in data infrastructure with specific use cases in mind to avoid over-optimization and to build a strong foundation for AI applications.
  • 🤖 Agentic workflows, where AI models engage in a process of thinking and refining their outputs, are emerging as a key strategy to enhance AI output quality.
  • 📊 The intelligence of AI models is largely derived from the data they've been trained on, suggesting that data curation is paramount for achieving intelligent behavior.
  • 🎓 Generative AI has the potential to transform education, making coding companions and customized courses more accessible and effective.

Q & A

  • What is the main topic of discussion in the transcript?

    -The main topic of discussion is Data Centric AI and its significance in contrast to the traditional model-centric approach in AI development.

  • What is the significance of data engineering in Data Centric AI?

    -Data engineering is critical to Data Centric AI as it involves structuring the data effectively to build successful AI systems, including error analysis, data curation, and making trade-offs between data quality and quantity.

  • How does Andrew view the shift from model-centric to data-centric AI?

    -Andrew sees the shift as a practical evolution where focusing on data engineering and curation is more fruitful than solely engineering the mathematical models or AI algorithms.

  • What is the role of data in training large language models?

    -Data plays a significant role in training large language models, with a large fraction of effort dedicated to acquiring the right data to feed into these models for effective training and fine-tuning.

  • What is an agentic workflow in the context of AI?

    -An agentic workflow in AI involves a series of steps where an AI model generates content, reflects on it, possibly does additional research, and iteratively improves the output before finalizing it.

  • How does Andrew perceive the current momentum in the Data Centric AI community?

    -Andrew finds the momentum in the Data Centric AI community quite exciting, noting the increased focus on data issues across various stages of AI model development.

  • What is the importance of data infrastructure in building AI applications?

    -Data infrastructure is mission critical in building AI applications as it provides the foundation for data storage, management, and accessibility, which are essential for developing and deploying AI models effectively.

  • What challenges do companies face regarding data in AI, according to Andrew?

    -Companies face challenges such as deciding how to store data, choosing cloud services, database schemes, and making trade-offs between cost and performance. Additionally, they struggle with architecting data for specific AI purposes.

  • How does Andrew suggest companies should approach improving their data infrastructure?

    -Andrew advises companies to invest in data for specific purposes, using AI wins as a way to drive data architecture improvements, rather than attempting to fix all data issues at once.

  • What is Andrew's perspective on the future of AI and its impact on education?

    -Andrew anticipates a transformation in education with generative AI playing a significant role, potentially acting as a coding companion or customizing courses, but he emphasizes the need for people to understand coding concepts to effectively use these tools.

  • What are Andrew's thoughts on the current state and future of Transformer architectures in AI?

    -While Andrew acknowledges the dominance of Transformer architectures, he also expresses interest in alternative models like SSRM and diffusion models, suggesting that the field should continue exploring new architectures.

Outlines

00:00

🤖 Introduction to Data Centric AI

The paragraph introduces a conversation between Andrew and Joe about Data Centric AI. Andrew explains the shift from model-centric AI, where progress was made by inventing new models and algorithms, to data-centric AI, where the focus is on curating and engineering data to improve AI performance. He discusses the importance of data in training large language models and how data issues persist through various stages of AI development, including fine-tuning and deployment. Andrew also highlights the Meta Llama 3.1 paper, where an earlier model was used to generate coding puzzles that trained the next version, showcasing the potential of using AI to create training data.

05:00

🔍 The Role of Data Engineering in AI

In this section, Andrew and Joe discuss the critical role of data engineering in AI. Andrew emphasizes the importance of data infrastructure and how it affects the success of AI systems. He mentions the challenges companies face in architecting data and the need for a purpose-driven approach to data improvement. The conversation also touches on the talent shortage in data engineering and how companies should approach investing in data for AI. Andrew suggests a cyclical approach where initial AI successes fund further data infrastructure improvements.

10:02

🧠 Exploring Agentic Workflows and Data Centric AI

The discussion continues with agentic workflows and their impact on AI. Andrew explains how agentic workflows, where AI models are prompted to think and refine their outputs iteratively, can lead to higher quality outputs compared to direct generation. He also talks about the use of these workflows in various applications and the importance of data curation to avoid model collapse when training AI on machine-generated data. The conversation highlights the need for a thoughtful approach to data generation and usage in AI training.

15:02

📚 Generative AI in Education

Andrew shares his thoughts on the potential of generative AI in education. He envisions a transformation in education where generative AI could serve as a coding companion, making coding easier and more accessible. He suggests that learning to code with the assistance of AI could align educational practices with the future of work, where coding companions are commonplace. Andrew also addresses the challenge of teaching students to use AI tools effectively while understanding the fundamentals of coding to avoid common errors.

20:04

🚀 The Future of AI Applications

In the final paragraph, Andrew expresses his excitement about the future of AI applications. He believes that while foundational work in AI, such as training models and improving data engineering, is essential, the real value will come from practical applications of AI. Andrew notes that despite the high costs associated with training foundation models, the economics at the application layer are favorable for innovation. He anticipates a surge in AI applications that will drive the field forward.

Mindmap

Keywords

💡Data Centric AI

Data Centric AI is an approach that emphasizes the importance of high-quality data over the model itself in building AI systems. It suggests that refining the data can often lead to better AI outcomes than solely focusing on the mathematical models or algorithms. In the transcript, Andrew discusses how the AI community is shifting from a model-centric world to one that values data curation and engineering, highlighting the momentum of the Data Centric AI community.

💡Model Centric

Model Centric refers to the traditional approach in AI where the focus was on developing and improving the algorithms or models to process data. The transcript mentions a historical perspective where AI progress was driven by people downloading datasets and then creating new models to perform better on those datasets.

💡Data Engineering

Data Engineering involves the practices and processes of managing and preparing data for use in AI systems. It is highlighted in the transcript as critical to Data Centric AI because it deals with the infrastructure, storage, and management of data, which directly impacts the effectiveness of AI applications.

💡Foundation Models

Foundation Models, also known as large pre-trained models, are AI models like large language models that have been trained on a vast amount of data. The transcript discusses the effort that goes into training these models and how data issues are a significant part of the process, from training to fine-tuning.

💡Generative AI

Generative AI refers to AI systems that can create new content, such as text, images, or music, that resembles the training data they were fed. The script mentions how Data Centric AI principles relate to the training of large language models, which are a key component of Generative AI.

💡Agentic Workflow

Agentic Workflow is a process where AI models are used iteratively to refine and improve outputs, similar to how a human might revise their work. The transcript gives an example where an earlier version of a model is used to generate coding puzzles, which are then used to train a newer version of the model.

💡Data Infrastructure

Data Infrastructure encompasses the systems, tools, and processes used to manage data within an organization. The transcript emphasizes the importance of getting data infrastructure right to support AI applications and how it can drive improvements in data architecture.

💡Model Fine-tuning

Model Fine-tuning is a process where a general AI model is adjusted or customized for a specific task or dataset. The transcript discusses how fine-tuning large language models involves a lot of data-oriented thinking and decision-making.

💡Data Curation

Data Curation is the selection and preparation of data for use in AI systems. The transcript mentions the ongoing challenge of curating high-quality data from the internet and how it's an important aspect of training better AI models.

💡Transformer Networks

Transformer Networks are a type of deep learning architecture that has become the standard for many AI models, especially in natural language processing. The transcript discusses how the intelligence of AI models comes not just from the architecture of Transformer Networks, but also from the data they've been trained on.

💡Educational Transformation

Educational Transformation refers to the changes and innovations happening in the field of education, often driven by technology. The transcript mentions the potential for Generative AI to transform education, suggesting that learning to code with AI companions could become the norm.

Highlights

Discussion on the shift from model-centric to data-centric AI.

Importance of data engineering in the practical application of AI.

The role of data in training large language models.

Data curation challenges in training AI models with machine-generated content.

Innovative use of an earlier version of a model to generate training data for its successor.

The significance of data in driving the intelligence of AI models.

The necessity of error analysis in data-centric AI systems.

The trade-offs between quantity and quality of data in AI applications.

The impact of generative AI on the future of work.

The role of data infrastructure in supporting AI applications.

The potential of agentic workflows in improving AI output quality.

The use of AI in education and the transformation it might bring.

The importance of understanding coding concepts when using AI coding companions.

The economic implications of AI applications and the potential for capital efficiency.

The future of AI and the focus on developing useful applications.

The potential of diffusion models for text generation as an alternative to Transformers.

Transcripts

play00:01

hi

play00:02

Andrew hey good to see you Joe good to

play00:04

see you too how's

play00:06

things uh the uh exciting times

play00:12

say awesome well yeah uh thanks for uh

play00:15

joining the show today it's good to see

play00:17

you um yeah so we're here to talk about

play00:20

a topic I think near and dear to you uh

play00:22

which is data Centric AI uh but also how

play00:24

data engineering um is critical to data

play00:27

Centric AI so um I guess to back up uh

play00:31

walk us through the beginnings of your

play00:33

of your um thoughts around data Centric

play00:35

AI because I I think when before that

play00:37

there was a more of a model Centric

play00:39

world we were in yeah so I think for

play00:42

many decades um AI progress was driven

play00:45

it feels like primarily by people say

play00:47

downloading data sets off the internet

play00:49

and then spending a lot of time trying

play00:51

to invent new map or invent new models

play00:53

to make it do better on that data and

play00:55

that's fine nothing's wrong with that

play00:56

because of that recipe AI made a lot of

play00:59

progress but I think many practitioners

play01:01

of AI including you and me and many

play01:03

others have known that if we're trying

play01:06

to build something you know practical

play01:08

ship it um sometimes uring the data is

play01:11

much more fruitful than trying to engine

play01:13

in the math or the model and so I wanted

play01:15

to uh creating and trying to popularize

play01:17

this term data Centric AI to coales a

play01:20

lot of the already ongoing work um on

play01:23

entering the data rather than the model

play01:25

and it's been quite uh exciting actually

play01:28

to see how much momentum to dat the

play01:30

centri AI Community

play01:34

has I think we we have to ask because I

play01:37

mean this is the main thing in the air

play01:38

right now generative AI does data

play01:41

Centric AI have something to do say

play01:43

about how we train large language models

play01:45

for example what are your thoughts on

play01:46

that I think very much so in fact um uh

play01:50

everything from training the foundation

play01:52

model to you know post training maybe

play01:55

fine-tuning to even some of the uh

play01:57

deployment usage seems to keep on

play01:59

running into data issues I know that in

play02:01

the popular press people talk a lot

play02:03

about scaling laws and building you know

play02:06

bigger Transformer networks or whatever

play02:08

to train even more data on and that is a

play02:10

key part of it but when I talk to my

play02:12

friends they're involved in you know the

play02:15

actual dayto day of how do you get these

play02:17

models to work um a large fraction of

play02:19

the efforts I'm tempted to say more than

play02:21

50% but L very large fraction of their

play02:24

fers is actually thinking through how to

play02:26

get the right data to feed into these

play02:28

Foundation models um and then of course

play02:31

you know even after someone else has

play02:33

trained a large language model large

play02:35

Foundation Model A lot of data workers

play02:37

go into fine-tuning it and then also in

play02:39

Practical deployment and usage um you

play02:42

know if you're doing a few shot learning

play02:44

lot lot of data oriented thinking there

play02:47

as well so not everything is data

play02:48

Centric AI but even in gen and

play02:51

Foundation models a much larger fraction

play02:53

of it is then I think people wiely

play02:54

appreciate I actually talk about this

play02:56

for a long time but you guys are oh no

play02:58

please do keep going yeah we're here to

play03:00

listen oh so maybe one one fun thing

play03:03

when uh in the Llama 3.1 paper uh The

play03:06

Meta release I think one of the coolest

play03:08

things you a lot of cool things in that

play03:10

paper but one of the coolest things was

play03:12

meta used an earlier version of the

play03:14

model used llama 3 um and then an

play03:16

agentic workflow to basically use llama

play03:19

3 to generate coding puzzles that were

play03:22

then used as training data to train

play03:24

llama 3.1 and I think this has always

play03:27

been one of the puzzles of how you get

play03:28

synthetic data work to train Foundation

play03:30

models and um using an agentic workflow

play03:33

we use the early model but let it think

play03:35

for a long time iterate over something

play03:37

over and over to come over good result

play03:39

and you train the Next Generation model

play03:41

to come over the equally good answer

play03:43

very quickly rather than need to think

play03:45

of over and over I thought that was um

play03:48

one really nice recipe um uh you know

play03:52

for for creating data to train

play03:54

Foundation models and really when I

play03:56

think when I talk my friends training

play03:57

you know some of the very large

play03:59

Foundation models um a lot of the head

play04:02

space is boy can I sign the right

play04:04

licensing deals with the Publishers to

play04:06

get the data and of all these datas

play04:08

which one do I invest dollars in to buy

play04:11

um or for the preference tuning be rhf

play04:13

or you know DPO to to line it with human

play04:16

values what's the labeling schema how do

play04:19

I get that data and then it turns out

play04:21

that while there's certain Innovations

play04:23

on trading Transformer networks and all

play04:25

that it feels like uh there's there's at

play04:27

least as much maybe even more hot to

play04:29

compare that there Sly a lot of thinking

play04:32

dayto day on um how to get the data to

play04:35

chain these

play04:36

models when you came up with the um I

play04:39

think the original uh article on um data

play04:41

Centric AI um and you're thinking around

play04:44

that that was in the time before chat

play04:47

GPT and I and I think that um how how do

play04:51

you differentiate between uh or is there

play04:53

a differentiation between data Centric

play04:54

AI um and maybe more classical um I I

play04:58

kind of let deep Lear into that to like

play05:00

classical like three gen Ai and then um

play05:02

gen AI is there is there a difference in

play05:04

how you approach data Centric um AI or

play05:07

is it all sort of the um same thing at

play05:09

the end of the day I don't know maybe

play05:11

are related to set of techniques on the

play05:12

Continuum um feel like you know data CI

play05:15

for vision is different than for text

play05:17

it's different for audio it's different

play05:19

when is a input modality like structure

play05:22

data that humans kind process that well

play05:24

ourselves so I think it is very I think

play05:26

of is different for um different types

play05:29

of data and types of modality but then

play05:31

there are underlying principles um

play05:34

really how do you systat engineer the

play05:36

data to build a successful AI system so

play05:39

things like error analysis to figure out

play05:41

where are the gaps um in order to try to

play05:43

get more data as well as what the

play05:44

techniques for datation to get high

play05:47

quality data and then the trade-offs

play05:49

between you know small amounts of high

play05:51

quality data versus large amounts of

play05:53

slightly lower quality data I see I see

play05:55

these themes that seem to be pervasive

play05:58

um Al the way from you know training

play06:01

computer vision models to the way we are

play06:04

trying to figure out which small handful

play06:06

of examples to put into a large language

play06:09

model prompt because we're doing F short

play06:11

learning um yeah I I think and then of

play06:15

course um underlying that I see with you

play06:19

know supervised learning and generative

play06:21

AI I see a lot of businesses um actively

play06:24

thinking about how to get the data

play06:26

infrastructure sorted out and gen

play06:29

certainly given Tailwinds to a lot of

play06:32

companies and boards and cosos wanting

play06:35

to get that data infrastructure right

play06:37

and I think that's been a that's been a

play06:39

positive change actually just increased

play06:41

urgency when people say we got got to

play06:43

sort out our data so that certainly has

play06:45

uh create a lot more urgency you know to

play06:48

to to the kind of stuff that you guys do

play06:50

I I I imagine it would been good for

play06:52

sales of your book as

play06:54

well very much so yeah we we came off of

play06:57

one hype cycle and now we we're kind of

play06:58

riding another

play07:00

yeah but it's it's it's interesting

play07:01

though because I feel like the

play07:04

um yeah I feel like there's a focus on

play07:06

the fundamentals and making sure you can

play07:08

invest in the the foundation to enable

play07:11

um you know analytics and and certainly

play07:13

machine learning and AI uh that was um

play07:16

but I feel like it's a hard realization

play07:18

for companies to get to because it's

play07:19

like I think initially it's a uh

play07:21

especially in the 2010s it's like well

play07:22

let's just jump straight into doing

play07:23

machine Learning Without uh much data at

play07:26

all um and I think all of us kind of saw

play07:28

how that worked out and so um yeah so

play07:32

with data engineering I guess is ends up

play07:34

being I guess fairly important to to all

play07:36

this so understanding it a bit

play07:40

um I guess what are your thoughts on uh

play07:43

the the role of data engineering um with

play07:45

uh with AI these days Andrew I think as

play07:48

Mission critical I am seeing uh

play07:51

significant you know uh Talent shortages

play07:54

uh as in I I I you know travel around

play07:57

the US travel around to different

play07:58

countries is now than usual to visit

play08:01

sometimes very large very profitable

play08:03

companies um not not tech companies

play08:05

sometimes sometimes tech companies

play08:06

sometimes not tech companies and um

play08:09

there are really smart people doing some

play08:11

business but kind of struggling to Think

play08:14

Through how the architect the data and I

play08:16

think it is difficult you know I think

play08:18

the number of decisions we have to make

play08:20

um in terms of how to store the data

play08:24

which cloud services to use you know

play08:26

what what's the right database scheme

play08:27

what are the trade-offs between cost and

play08:29

performance um and then also one of the

play08:32

other challenges is most companies it

play08:35

turns out don't want to spend you know

play08:38

whatever a million dollars to just to

play08:40

improve the data because you you want to

play08:42

improve the data for a purpose and so

play08:45

hoping businesses get in that cycle of

play08:48

when you can hopefully start to deliver

play08:51

wins on the AI machine learning

play08:53

generative AI site but concurrently even

play08:55

as you're building valuable applications

play08:57

um use that as a way to Drive how you

play09:00

improve your data architecture so that

play09:03

your foundations to keep building on top

play09:05

of become even better so so I have say

play09:09

there are there are some cosos that have

play09:10

said great let me invest a lot of money

play09:12

to fix all my data and then it'll be

play09:14

beautiful and then I'll have wonderful

play09:15

Ai and I think you would both advice

play09:17

people like please don't do that right

play09:19

you do need to invest in the data but

play09:21

the data is usually built out for a

play09:23

specific purpose because otherwise there

play09:25

too many things we can optimize you make

play09:26

it faster speedier more distributive

play09:28

more robust what whatever it's two IND

play09:30

decisions but if you have one or a

play09:32

handful of use cases that you're driving

play09:34

toward then that helps the data team um

play09:38

make the right priorization decisions uh

play09:41

in order to improve the data

play09:42

infrastructure and then that creates an

play09:44

amazing foundation for lots of people to

play09:46

build lots of exciting applications on

play09:49

top of but I know I I feel like this is

play09:50

advice I give to kind of companies quite

play09:53

often I I I'm I'm guessing you give very

play09:55

similar advice to many businesses all

play09:57

the time we do yeah and it's yeah it is

play10:01

interesting I suppose but you know I

play10:03

think every company we talk to has to

play10:05

have an AI story right now so um whether

play10:07

you like it or not you're going to have

play10:08

to figure out a way as a company to to

play10:11

um you know start working with AI on

play10:13

that note too I would like your your um

play10:15

thoughts on what you're seeing or

play10:17

hearing when you talk to your friends

play10:19

about uh small language models and also

play10:22

AI agents um what are you seeing out

play10:25

there so I think uh agentic workflows is

play10:29

is one of the most exciting um

play10:32

directions for AI uh so you know I think

play10:36

the way a lot of people use the large

play10:38

language model is we prompt it and then

play10:40

we expect it to write an essay for us on

play10:43

whatever we ask and that's a bit like

play10:45

going to a person and saying hey buddy

play10:47

please write an essay for me by typing

play10:49

the essay from the first word to the

play10:51

last word all in one go without ever

play10:52

using back space and you know maybe you

play10:55

can write like that but most of us don't

play10:56

do our best writing that way in contrast

play10:59

to agentic workflow might ask the large

play11:01

language model to um brainstorm and

play11:03

outline and then ask if uh you need to

play11:07

do any online research if so go download

play11:09

a few web pages and put them into the

play11:11

large language model context then write

play11:12

the first draft they read the first

play11:14

draft in critique it and and and so on

play11:17

um and so uh we're seeing with many

play11:20

agentic workflows the quality of output

play11:24

is much higher than you know anyone than

play11:26

than possible with just having with

play11:28

direct Generation Um and I'm seeing this

play11:31

useful for for many applications so

play11:33

actually you know de AI interally as

play11:35

some agentic workflows AI fund which

play11:38

also lead um has many projects in kind

play11:40

of healthcare you know legal compliance

play11:43

processing various types of complex

play11:45

documents where we couldn't do the job

play11:47

without an agentic workflow and then the

play11:49

most interesting thing I think um uh was

play11:52

it uh GP 401 uh uh preview uh just

play11:57

released recently and then also

play11:59

anthropic also been doing something

play12:01

related for months uh which is a

play12:04

fine-tune the large language model um to

play12:06

generate you know thinking tokens um

play12:10

along the way and I think this

play12:12

incorporating an agentic workflow where

play12:14

the Lun langage model is um pre-trained

play12:17

or sorry fine tuned to do sort of Chain

play12:19

of Thought reasoning so it spends more

play12:21

time thinking uh before it outputs you

play12:24

know the final answer I think

play12:26

incorporating this directly into the LA

play12:28

langage model is exciting Direction

play12:30

actually I think for a few months uh uh

play12:33

I I I I I think it's been you know

play12:36

jailbroken for months right so that's

play12:37

what we know anthropic uses a tag um XML

play12:40

tag and thinking I think but it's you

play12:43

know various people have jailbroken it

play12:44

to make it basically review the tag and

play12:47

show his internal thinking dialogue so

play12:49

that this so so I I think this is public

play12:52

I don't think I'm reviewing new already

play12:55

the internet so I thought I thought

play12:56

actually really clever but I think the

play12:59

open eyes taken it to a whole other

play13:01

level with this new release model and

play13:03

then I think it was just one or two

play13:05

weeks ago there was a um overhyped

play13:08

reflection 70b model with the initial

play13:11

claims turned out not to be quite

play13:12

accurate but that was also exploring a

play13:14

similar technique and even though the

play13:16

hype and the inaccuracy of the initial

play13:18

results was unfortunate you know the the

play13:20

underly technique it seemed like an

play13:22

interesting very interesting Direction

play13:23

takes well so I think this is in the air

play13:26

um agentic work those that people can

play13:28

implement but in also Al uh that is

play13:31

possible to fine-tune a large language

play13:34

model to basically do Chain of Thought

play13:36

reasoning internally and maybe use some

play13:38

tags to have some thinking you know that

play13:41

may or may not need to be reviewed to

play13:42

the end user so I I think there's a

play13:45

fascinating direction oh and then of

play13:47

course uh a lot of the work to do this

play13:50

is you know come of the data set right

play13:52

to show it how to Think Through what's

play13:54

the right thinking process different

play13:56

types of tasks so but so so exciting

play13:59

times that's interesting I guess what

play14:01

are you gonna ask Matt oh I I was going

play14:03

to ask and this is kind of related to

play14:04

what you're talking about I think one of

play14:06

the problems that's in the air right now

play14:08

is so theoretically if you take the

play14:10

output of a model and then retrain on

play14:13

that output you degrade the weights

play14:15

basically right it's like recording a

play14:16

signal that has extra noise in it and

play14:18

then there also been research papers

play14:19

that have shown this experimentally um

play14:22

what what are your thoughts on how we

play14:23

solve this problem especially as more

play14:24

and more of the content on the web that

play14:26

we're using to train is machine

play14:27

generated and deep learning

play14:29

generated yeah good question so I think

play14:32

I think the the data curation of uh

play14:34

selecting high quality data off the

play14:37

internet I think that's an ongoing thing

play14:40

um and I think so it turns out if you

play14:43

use a large language model to generate

play14:45

text and you train a different model on

play14:47

that um that is a good idea if you're

play14:50

applying model distillation if you have

play14:52

a large model generating very thoughtful

play14:54

text you want to train a small model to

play14:56

mimic the thoughtfulness of a big model

play14:58

you know that's that works that's model

play15:00

distillation but I think as you're

play15:02

saying Matt what doesn't work is if you

play15:05

train a model use that to generate data

play15:07

and then use that data to try to train

play15:09

an even better model in fact a few

play15:11

researchers have shown that if you do

play15:13

this process enough times generate data

play15:15

train a new model have the new model

play15:17

generate data use that to train the next

play15:18

model then you actually end up with

play15:20

model collapse right where where the

play15:22

model starts generating very

play15:24

uninteresting things but where this

play15:26

technique does Really Work Well is If

play15:29

instead of using direct generation

play15:31

instead of copy pasting one model's

play15:34

output into the training data of the

play15:35

next model if you instead use this type

play15:39

of agentic workflow where the first

play15:41

model you might have it write an essay

play15:43

reflect on it do some web search you

play15:45

know then critique it and improve it so

play15:47

it does a lot of work to come up with a

play15:49

pretty good essay and then you try to

play15:51

get the next generation of model to

play15:53

generate that essay directly with much

play15:56

less work than the first model had then

play15:58

that does seem a whole maybe analogies

play16:01

of human thinking are always dangerous

play16:03

so I was nervous about making that but

play16:06

you know I remember when um when I was a

play16:08

kid you know practicing for math

play16:10

competitions or whatever right I would

play16:12

spend a long time trying to solve some

play16:14

math problem but having solved that

play16:16

myself I go oh next time maybe that's a

play16:19

shortcut I could use to solve the next

play16:21

problem so training on your own thinking

play16:24

is okay if you learn to do quickly what

play16:26

took you yourself a long time to do and

play16:29

and so it's been so this does seem to

play16:31

work for L languish models as well

play16:34

inter so it's almost like a

play16:36

self-governing approach that you're

play16:37

looking for not just training on data

play16:40

but actually critiquing and then like

play16:42

you're saying an agentic almost

play16:44

thoughtful process you go through yeah

play16:47

yeah yeah so and I think um yeah and and

play16:52

and maybe just I I I think uh lot of the

play16:54

large companies training you large

play16:56

Foundation models um kept lot the

play16:59

details of what they do right some of

play17:01

proprietary but I definitely you know uh

play17:05

get a strong sense talking to multiple

play17:07

people from multiple companies a lot of

play17:09

the head space is not just in tuning the

play17:12

foundation model and making sure the

play17:13

gpus are reliable whatever there is a

play17:15

lot of that too but a lot of head space

play17:18

isn't freeing out the

play17:20

data I think I think you know a AI right

play17:23

AI models is uh the model Plus data and

play17:27

in fact if you know if um actually

play17:30

common experience for a lot of people if

play17:31

you go through the math of what a

play17:33

Transformer neuron network does you know

play17:35

go through the attention mechanism right

play17:37

blah blah blah um I've had a really

play17:40

interesting experience a lot of people

play17:42

learning that from the first time they

play17:43

go what I don't get it how could you

play17:47

know like a few lines of math

play17:50

demonstrate this weird intelligent

play17:52

behavior of a large language model and I

play17:55

think the answer is um the magic is is

play17:59

not only from the Transformer neuron

play18:01

Network which is very clever you know a

play18:04

lot of the intelligence of the L Anish

play18:06

model comes not from the neuron Network

play18:08

architecture but it comes from the data

play18:10

and this is why when people wonder I

play18:13

want understand lar langage models let

play18:15

me study the Transformer neuron Network

play18:17

very common reaction is okay I finally

play18:21

worked through all this math but I don't

play18:23

get it it doesn't make sense why this

play18:24

MTH would be so intelligent and I think

play18:26

the the Gap is lot that code

play18:29

intelligence comes from these models

play18:31

having sucked in massive Text data sets

play18:34

you know generated by mostly humans we

play18:36

hope uh and that's what's that data is

play18:39

what's creating a lot of the

play18:41

intelligence or appearance of

play18:42

intelligence in these

play18:44

models I guess you

play18:47

um uh you have a lot of friends in the

play18:49

space are you seeing um maybe an

play18:52

evolution of something outside of the

play18:53

Transformer architecture right now or

play18:55

are we uh or are we stuck with this for

play18:57

a while you know it's a great question

play19:00

um I I I I think uh transform

play19:02

architecture seems to have strong

play19:04

Tailwinds uh the what ssrm stasis models

play19:07

have been kind of around for a little

play19:09

bit they have not really taken off yet

play19:11

uh uh but you know still enough

play19:13

researchers working on it I think it's

play19:15

fascinating we keep an eye on scaled to

play19:18

very long input context very interesting

play19:19

ways uh and then uh the other one that I

play19:23

don't never will take off is the

play19:25

diffusion models for text generation uh

play19:28

really

play19:29

just very recently I see my paper on

play19:31

this fascinating uh today I think the

play19:34

dominant model for image generation or

play19:36

diffusion models where you generate a

play19:38

blurry image and then repeatedly kind of

play19:40

quote remove noise to sharpen up the

play19:42

image and so um uh I think Stephano man

play19:46

and some folks came a way to generate a

play19:49

quote blurry piece of text and then

play19:50

slowly sharpen it up and and when

play19:53

trained to be gpt2 size it seems to

play19:57

outperform gpt2 but you know JW's out on

play20:00

on what will happen as as this scills

play20:02

out so I I don't know if it'll work but

play20:03

this I think if we're stuck with

play20:05

Transformers and nothing else for a long

play20:07

time I think we'll be fine but you know

play20:09

with ssrm and diffusion models and

play20:11

others trying other things I think you

play20:13

know collectively our community has a

play20:15

few shots at coming up with something

play20:17

even better no pun

play20:19

intended um kind of Switching gears a

play20:22

bit so education's also a big passion of

play20:26

yours um you started companies around it

play20:28

before

play20:29

um and where do you think uh generative

play20:32

AI fits into uh education these days say

play20:36

I want to learn a topic um you know I'm

play20:39

teaching a topic where do where does it

play20:40

fit in yeah so I think I feel like there

play20:44

is a coming transformation of Education

play20:47

maybe that I don't feel like I know

play20:49

exactly what it will be yet um you know

play20:51

there stuff that um a few companies have

play20:54

done you know I think uh K Academy built

play20:57

kigo which seems to work well Cera has

play20:59

CER coach which actually works really

play21:01

well uh people haven't tried it I've

play21:03

used a bunch it's actually you know

play21:04

surprisingly good um but I think that uh

play21:08

that's just one product idea um uh corer

play21:12

uses gen for course Builders so quite a

play21:14

lot of companies are using cuse Builder

play21:17

to customize causes for specific

play21:20

Enterprises needs so you know there's a

play21:22

bunch of ideas like this um like a aita

play21:24

and so on I think that they could be

play21:27

bigger transformation in education

play21:29

coming and I don't feel like I know

play21:31

exactly what it is um but maybe uh when

play21:35

I when I chat of uh um you know

play21:38

University leaders one other thing I

play21:39

often talk end up talking about is the

play21:42

future of work uh because one challenges

play21:46

maybe take coding as as an

play21:48

example I think we should just all learn

play21:52

to code with generative AI as a coding

play21:54

companion um I know that some schools

play21:57

are still debating what are not to ban

play21:59

you know chat gbt for the programing

play22:01

causes uh but honestly I think it's

play22:03

clear the future of software engineering

play22:05

will be coding alongside the coding

play22:07

companion and we can decide you know is

play22:10

it GitHub co-pilot or cursor or copy

play22:13

pasting directly from gb4 or CLA or

play22:15

Gemini which I still do a lot of

play22:17

actually um but you never have to learn

play22:20

to code alone and so I think aligning

play22:23

the way we teach programmers um with the

play22:26

future with not where the field has been

play22:28

but where the feud is going I think we

play22:30

need to do that and that's just computer

play22:33

science I think in education you know

play22:34

thinking through what will future

play22:36

chemical engineer do what a future

play22:38

doctor do I think that's actually a big

play22:39

challenge of academic institutions but

play22:42

on coding specifically I think every I I

play22:45

would like I would like to see pretty

play22:46

much everyone learn to code uh because

play22:49

with a coding companion with Jenny I

play22:51

hope coding is easier than ever before

play22:55

um and the value of someone being able

play22:57

to write a little bit code is higher

play22:59

than ever before so I'm seeing you know

play23:02

soft engineers get meaningful

play23:04

productivity boosts we just do a lot

play23:05

more right when you use geni but then

play23:08

also among among my teams I'm seeing

play23:11

kind of marketers and investors and kind

play23:13

of people who job Ro is not software

play23:15

engineer um write just a little bit of

play23:18

codes uh to download web pages

play23:20

synthesize it get insights and I find

play23:22

that people that know just a little bit

play23:23

of code um can do a lot more than often

play23:26

do a lot more so that's why we we we

play23:28

release this free sequence of causes uh

play23:31

ai ai python for beginners to to help

play23:34

people you know uh learn coding for the

play23:37

first time so so one of the complaints

play23:39

I'm seeing from open source maintainers

play23:41

is that you have people who will use

play23:44

like chat gbt to generate code and then

play23:46

they try to run it and it doesn't work

play23:48

correctly and it turns out there's a

play23:49

really basic error like a variable is

play23:51

named inconsistently right maybe it's

play23:53

even capitalization or something and so

play23:55

the variables don't match so it doesn't

play23:56

work and they they know no coding and so

play23:59

they don't know how to debug those

play24:00

errors errors so how do we like teach

play24:02

students to use the tool but also you

play24:05

know have enough insight into how coding

play24:07

actually works so they're actually

play24:08

learning the the minutia that they need

play24:10

to know to write good code yeah I know I

play24:13

know social media tends explode with the

play24:15

you know code and then why I build this

play24:18

thing that's very cool that they did

play24:20

that uh and I think at least for now

play24:22

those are the exceptions and I'm seeing

play24:24

that but hopefully there'll be more and

play24:26

more of these exceptions but I think

play24:28

I find people get a lot more traction

play24:31

with low code than no code and if you

play24:34

know just a little bit about what does

play24:36

the word even exception mean you know

play24:38

and and there these Concepts um I think

play24:42

I I think for quite a long time someone

play24:44

that knows a little bit of programming

play24:46

Concepts uh will be to do a lot more

play24:49

than someone that you know doesn't know

play24:51

coding at all and just prompting um the

play24:55

the the boundary is Shifting because of

play24:58

Improvement in technology I think

play24:59

anthropics um artifacts was really

play25:01

clever right helping take Bas take

play25:04

things to deployment and I know kind of

play25:06

replate you know also making it easy

play25:08

reducing friction deployment I uh uh and

play25:10

I think um uh I personally build a lot

play25:14

of um streamlet apps because it turns

play25:16

out you know gbd4 is really good at

play25:18

writing streamlet code so I don't worry

play25:20

about the syntax just thrill stuff up in

play25:21

minutes on on like a cloud you know or

play25:25

whatever um so I think I think yeah I I

play25:28

but but by find that people that

play25:30

understand the bit of the coding

play25:31

Concepts um you just get much further

play25:35

much quicker and it's less like less

play25:37

likely to hit a dead end I

play25:40

think we have about three minutes left

play25:43

um what are you most excited about over

play25:45

the next

play25:46

year

play25:48

um excit about many many things but they

play25:51

made me pick one it would be um

play25:53

applications uh I think that uh there's

play25:56

a lot of foundational work to be done um

play25:59

including training Foundation models

play26:01

better technology tons of work to be

play26:03

done in data engineering but where I

play26:05

think many people you know will get the

play26:08

most value out of it will be the

play26:10

applications um and we do need to work

play26:13

on the foundations the shortage of

play26:14

people that understand how to build the

play26:16

data the foundation models the

play26:17

infrastructure got to keep on working on

play26:19

that and then I think ultimately you

play26:22

know our field would be judged by our

play26:24

success um at delivering useful

play26:26

applications so I spend a lot of my time

play26:29

focusing on applications uh but it

play26:31

actually work on applications that then

play26:33

you know causes me to sometimes have a

play26:36

strong view maybe right maybe wrong that

play26:38

boy we really need to get this data

play26:41

infrastructure right or boy we really

play26:42

need to get this orchestration later

play26:45

right but but I I I think that we're

play26:47

actually starting to see a rising flood

play26:49

of applications oh and one one one

play26:52

interesting thing about applications as

play26:53

well I know people read in the news

play26:55

about these you know billions of dollars

play26:58

more than single digit billions of

play26:59

dollars spent on gpus to train

play27:02

Foundation models and people think doing

play27:04

AI is really expensive right but it

play27:06

turns out that because someone else has

play27:08

spent you know these tens of billions of

play27:10

dollars um is now very Capital efficient

play27:14

to start to work on some data

play27:16

infrastructure to start to work on some

play27:18

application so I think at the

play27:20

application layer the economics look

play27:22

very favorable to to people you know

play27:24

building and want to deploy stuff

play27:28

it's

play27:29

awesome Andrew it's uh great to talk

play27:31

with you as always um thanks for thanks

play27:33

for joining the podcast for taking the

play27:34

time great yeah thank you great time you

play27:37

guys thank you Joe thank you man yeah

play27:39

cool right take care

Rate This

5.0 / 5 (0 votes)

Связанные теги
Data-Centric AIAI EvolutionData EngineeringMachine LearningGenerative AIModel TrainingData InfrastructureAI ApplicationsEducational TechCoding Companion
Вам нужно краткое изложение на английском?