What we see and what we value: AI with a human perspective—Fei-Fei Li (Stanford University)

Paul G. Allen School
20 Jan 202460:25

Summary

TLDR李飞飞教授在演讲中分享了计算机视觉和人工智能的发展历程,强调了深度学习和大数据在推动技术进步中的关键作用。她提出了“以人为本的人工智能”理念,探讨了AI在医疗、家庭护理和社会服务中的应用,并展望了未来AI技术在理解和改善世界方面的潜力。

Takeaways

  • 🎓 李飞飞教授是斯坦福大学的教授,也是斯坦福大学人工智能与人类中心研究所的主任。
  • 🚀 李飞飞曾在2017至2018年担任谷歌副总裁以及谷歌云AI和机器学习首席科学家。
  • 📚 李飞飞因ImageNet项目而广为人知,该项目推动了深度学习AI革命。
  • 📖 她出版了书籍《The World I See》, 讨论了AI的好奇心、探索和发现。
  • 👀 李飞飞通过计算机视觉的视角,分享了她多年来的研究成果和AI的发展。
  • 🌿 她讨论了自然视觉的历史,从5.4亿年前的寒武纪大爆发讲起,强调视觉智能在动物智能发展中的重要性。
  • 🤖 计算机视觉领域的发展,从早期的手工设计特征到深度学习和卷积神经网络的应用。
  • 🔍 李飞飞强调了数据的重要性,特别是ImageNet和Visual Genome数据集对AI发展的贡献。
  • 🧠 她提到了AI在医疗、辅助人类工作和社会关怀等方面的应用潜力。
  • 🌐 李飞飞讨论了AI在处理隐私和偏见问题上的重要性,以及如何通过技术来保护人类尊严和身份。
  • 🔮 她展望了AI的未来,包括AI在科学发现、个性化教育、医疗保健和生物多样性研究中的应用。

Q & A

  • 费菲·李教授在斯坦福大学的职位是什么?

    -费菲·李教授是斯坦福大学的教授,并且担任斯坦福大学人工智能与人文中心研究所的主任。

  • 费菲·李教授在谷歌的职位有哪些?

    -费菲·李教授在2017年至2018年期间担任谷歌的副总裁,同时也是谷歌云AI和机器学习部门的首席科学家。

  • 费菲·李教授因哪项工作而广为人知?

    -费菲·李教授因ImageNet这项工作而广为人知,该项目推动了深度学习AI革命的发展。

  • 费菲·李教授在联合国的角色是什么?

    -费菲·李教授担任联合国秘书长的特殊顾问。

  • 费菲·李教授最近出版的书籍名称是什么?

    -费菲·李教授最近出版的书籍名为《The World I See: Curiosity, Exploration, and Discovery at the Dawn of AI》。

  • 费菲·李教授在演讲中提到的历史事件是什么?寒武纪大爆发对生物进化有什么影响?

    -费菲·李教授提到的是寒武纪大爆发,这是大约5.4亿年前发生的事件。寒武纪大爆发对生物进化的影响是引发了动物物种数量的爆炸性增长,这个时期也被称为进化的“大爆炸”。

  • 费菲·李教授如何看待计算机视觉的历史?

    -费菲·李教授认为计算机视觉的历史比生物进化短得多,始于大约60年前。她提到了一个MIT教授的雄心壮志,他希望在一个夏天解决视觉问题,但这个目标并没有实现。尽管如此,60年后的今天,视觉技术已经发展成为一个繁荣的领域。

  • 费菲·李教授提到的“我们看到的与我们重视的”是什么意思?

    -费菲·李教授的演讲主题是“我们看到的与我们重视的——AI的人类视角”。这指的是她将通过计算机视觉的发展历程,探讨AI如何从模仿人类视觉开始,发展到关注人类的需求和价值观,以及如何以人为中心地发展AI技术。

  • 费菲·李教授如何看待深度学习在计算机视觉中的作用?

    -费菲·李教授认为深度学习,特别是卷积神经网络,是计算机视觉领域的一个重要突破。她提到了ImageNet项目如何推动了深度学习的发展,并导致了像RESNET这样的创新,这些创新又为注意力机制等后续技术奠定了基础。

  • 费菲·李教授提到的“我们看不到的”是指什么?

    -费菲·李教授提到的“我们看不到的”是指那些人类视觉能力所不能及的领域,比如细粒度的物体识别、视觉错觉中未能注意到的细节,以及在医疗、健康和社会问题中存在的偏见和隐私问题。她强调了AI在这些领域的潜力,以及我们应该如何努力确保AI的发展能够符合人类的价值观和需求。

  • 费菲·李教授如何看待AI的未来?

    -费菲·李教授认为AI的未来将会继续受到大脑科学、认知科学和人类智能的启发。她强调了以人为中心的AI方法的重要性,这种方法关注于开发能够增强人类能力的AI技术,并且认识到AI对人类社会的影响。她还提到了斯坦福大学人工智能与人文中心研究所的工作,以及如何通过教育和政策参与来推动这一领域的进步。

  • 费菲·李教授在演讲中提到的“BEHAVIOR”项目是什么?

    -“BEHAVIOR”是费菲·李教授和她的研究团队正在进行的一个项目,旨在创建一个基准测试,用于评估和训练机器人在日常家庭活动中的表现。该项目通过调研确定人们希望机器人帮助完成的1000项任务,并在多个真实世界环境中扫描,创建了一个模拟环境,用于训练和测试机器人。

Outlines

00:00

🎤 开场与介绍

演讲开始,主持人热烈欢迎李飞飞教授,并提到她的众多成就和贡献。李飞飞教授是斯坦福大学的教授,曾担任谷歌云的首席科学家和副总裁,也是联合国秘书长的特别顾问,以及国家人工智能资源任务组的成员。她还出版了书籍《The World I See》。李飞飞教授的演讲主题是“我们所见与我们所重视——人工智能的人文视角”。

05:01

🌿 计算机视觉的历史与发展

李飞飞教授回顾了计算机视觉的历史,从60年代的初步尝试到现代的深度学习革命。她强调了数据集如ImageNet的重要性,并介绍了计算机视觉领域的三个发展阶段:手工设计特征、机器学习以及深度学习的兴起。她还提到了视觉智能的重要性,以及如何通过视觉理解来增强人工智能。

10:03

🧠 人类视觉的进化与计算机视觉的挑战

李飞飞教授探讨了人类视觉的进化,从5.4亿年前的寒武纪大爆发到现代人类的视觉系统。她解释了视觉对于动物生存的重要性,并讨论了计算机视觉在模拟人类视觉方面的挑战,包括对象识别和场景理解。她还提到了人类大脑在视觉识别上的高效能力,以及计算机视觉如何通过大数据和深度学习来模仿这一过程。

15:06

📈 图像网(ImageNet)的创立与影响

李飞飞教授详细介绍了ImageNet的创立背景和它对计算机视觉领域的影响。ImageNet是一个大型的图像数据库,旨在促进视觉识别算法的发展。她分享了ImageNet如何推动了深度学习的发展,并导致了计算机视觉领域的许多重要突破。

20:07

🔍 超越对象识别:场景理解与视觉关系

李飞飞教授讨论了计算机视觉的下一个挑战——超越单纯的对象识别,去理解和解释场景中的关系和结构。她介绍了场景图表示法,这是一种新的视觉表示方法,能够捕捉场景中对象之间的关系。她还提到了Visual Genome数据集,以及如何通过这些方法来提高计算机视觉的理解和推理能力。

25:08

🤖 计算机视觉的应用:从健康护理到机器人技术

李飞飞教授探讨了计算机视觉在现实世界中的应用,特别是在健康护理和机器人技术领域。她分享了如何使用计算机视觉来改善医疗错误、提高手术效率、监测病人的移动性以及辅助老年人的生活。她还讨论了隐私保护的重要性,并介绍了如何设计既能保护隐私又能提供有用信息的系统。

30:13

🌐 人工智能的未来与人类中心的AI

李飞飞教授对人工智能的未来进行了展望,强调了人类中心的AI的重要性。她讨论了如何将人工智能的发展与人类的需求和价值观相结合,以及如何利用AI来增强和提升人类的能力。她还提到了斯坦福大学人类中心AI研究所的工作,以及如何通过教育和政策参与来推动AI的负责任使用。

35:14

💡 问答环节:AI的未来与挑战

在问答环节中,李飞飞教授回答了关于AI未来突破的问题,她预测AI将在多个应用领域产生深远影响,并期待计算机视觉领域的下一次革命。她还讨论了大型语言模型在语义理解方面的成就,以及图像和计算机视觉领域的未来前景。最后,她回应了关于英语在AI研究中占主导地位的问题,强调了数据集多样性的重要性。

Mindmap

Keywords

💡人工智能

人工智能是指由人制造出来的机器所表现出来的智能,能够像人类一样进行学习、判断和行动。在视频中,Fei-Fei Li教授讨论了人工智能的发展,特别是在计算机视觉领域的应用,以及如何通过人工智能来增强和提升人类的能力。

💡计算机视觉

计算机视觉是人工智能的一个分支,它使计算机能够从图像或多维数据中识别和处理视觉信息。在视频中,Fei-Fei Li教授强调了计算机视觉在理解世界结构方面的重要性,并提到了如何通过计算机视觉技术来超越人类视觉的限制。

💡深度学习

深度学习是一种机器学习技术,它通过模拟人脑神经网络的结构和功能来学习数据的表示和模式。在视频中,Fei-Fei Li教授讨论了深度学习在推动计算机视觉发展中的关键作用,尤其是在图像识别和分类任务中。

💡数据集

数据集是用于训练和测试机器学习模型的数据的集合。在视频中,Fei-Fei Li教授提到了数据集在人工智能发展中的重要性,尤其是在计算机视觉领域,如ImageNet数据集对于推动视觉识别技术发展的作用。

💡认知科学

认知科学是研究人类思维和知觉过程的跨学科领域,它结合了心理学、神经科学、人工智能等多个学科的知识。在视频中,Fei-Fei Li教授强调了认知科学对于理解和发展人工智能的重要性,尤其是在模拟人类视觉和认知过程方面。

💡神经网络

神经网络是一种模仿人脑神经元连接方式的计算模型,用于实现机器学习和人工智能。在视频中,Fei-Fei Li教授讨论了神经网络在计算机视觉中的应用,尤其是卷积神经网络(CNN)在图像识别和分类中的关键作用。

💡场景图表示

场景图表示是一种用于描述图像中对象及其相互关系的方法,它通过识别场景中的实体(如人、物体)和它们之间的关系来捕捉视觉世界的丰富性。在视频中,Fei-Fei Li教授提到了通过场景图表示来理解和分析视觉内容的重要性。

💡健康护理

健康护理是指提供给个体或群体的医疗和保健服务,旨在维护或改善他们的健康状况。在视频中,Fei-Fei Li教授讨论了人工智能如何通过计算机视觉技术来增强健康护理领域,例如通过监控手术中的手卫生或跟踪ICU病房中患者的移动。

💡增强现实

增强现实(AR)是一种技术,它通过在用户的视野中叠加计算机生成的图像和信息,来增强现实世界的感知。在视频中,Fei-Fei Li教授提到了增强现实在医疗手术中的应用,如通过计算机视觉技术来辅助医生和护士在手术中跟踪小工具。

💡隐私保护

隐私保护是指采取措施来保护个人或敏感信息不被未经授权的访问、使用或披露。在视频中,Fei-Fei Li教授强调了在开发人工智能应用时,尤其是在健康护理领域,保护人类尊严和身份的重要性。

💡机器人学习

机器人学习是指通过人工智能技术使机器人能够学习和执行任务,这通常涉及感知、决策和行动的集成。在视频中,Fei-Fei Li教授讨论了机器人学习的重要性,以及如何通过结合感知和行动来实现更高级的机器人技术。

Highlights

Fei-Fei Li is a professor at Stanford and a leading figure in the field of AI, with extensive experience and contributions.

She directed the Human Centered AI Institute at Stanford and was the VP at Google and Chief Scientist of AI and Machine Learning at Google Cloud.

Fei-Fei Li is known for her work on ImageNet, which significantly contributed to the deep learning AI revolution.

She serves as a special advisor to the Secretary General of the United Nations and is a member of the National AI Resource Task Force.

Li's book, 'The World I See: Curiosity, Exploration, and Discovery at the Dawn of AI', reflects her perspective on AI development.

The talk focuses on the history and evolution of computer vision and AI from a human perspective.

The Cambrian Explosion in evolution is linked to the development of vision, which drastically changed the course of life on Earth.

Computer vision has come a long way from its early days, with deep learning and big data playing pivotal roles in its progress.

Fei-Fei Li's research emphasizes the importance of object recognition in computer vision and its challenges.

The ImageNet dataset was a game-changer, enabling a new level of object recognition and categorization in AI.

Li's work on scene graph representation and the Visual Genome dataset expanded AI's understanding of visual relationships.

Advancements in computer vision have allowed AI to recognize and understand dynamic relationships and activities in images and videos.

AI can now perform tasks that are beyond human capabilities, such as fine-grained object categorization.

Computer vision can reveal societal trends and patterns by analyzing everyday images, such as those from street views.

Li's work on privacy computing aims to protect human dignity and identity while still utilizing AI's capabilities.

The talk highlights the importance of addressing biases in AI systems to ensure ethical and fair outcomes.

Fei-Fei Li discusses the potential of AI to augment human capabilities, especially in healthcare, and improve patient outcomes.

The BEHAVIOR project aims to create a benchmark for everyday household activities to advance robotic learning.

Li emphasizes the need for AI to be inspired by human intelligence and to focus on augmenting human capabilities.

Transcripts

play00:03

PRESENTER: Welcome, everybody.

play00:04

I'm very excited to welcome Fe-Fei Li today.

play00:07

And of course, judging by how packed this room is,

play00:10

Fei-Fei doesn't really need an introduction.

play00:12

And of course, if I actually were

play00:14

to introduce her by reading her bio,

play00:16

it would take a majority of today's time.

play00:19

So I'll keep this brief.

play00:20

Fei-Fei is a professor at Stanford, where

play00:23

she was also my PhD advisor.

play00:26

She's the director of the Human Centered AI Institute

play00:29

at Stanford.

play00:30

And during the years of 2017 and 2018,

play00:33

she was also the Vice President at Google,

play00:36

as well as the Chief Scientist of AI and Machine

play00:39

Learning at Google Cloud.

play00:41

She, of course, has published hundreds of papers.

play00:43

And perhaps one of the ones that a lot of people know her for

play00:46

is ImageNet, which as you all know,

play00:50

has ushered in the deep learning AI revolution

play00:52

that we're all in today.

play00:54

She also serves as a special advisor

play00:56

to the Secretary General of the United Nations,

play00:59

and is also a member of the National AI Resource Task Force

play01:05

for the White House Office of Science and Technology.

play01:09

And recently, she also has published her book

play01:12

titled, The World I See--

play01:13

Curiosity, Exploration, and Discovery at the Dawn of AI.

play01:17

And I'm sure she'll be talking about parts of that book today.

play01:21

So with that, Fei-Fei, welcome to the University

play01:23

of Washington.

play01:24

FEI-FEI LI: Thank you.

play01:25

[APPLAUSE]

play01:29

Thank you.

play01:30

Thank you.

play01:31

Well, it's quite an honor to be here.

play01:33

Actually it's as a professor one of the greatest joys and honor

play01:38

is to work with the best students,

play01:40

and see how their career has grown.

play01:42

And so being invited by Ranjay and his colleagues

play01:46

is really very special.

play01:48

And I'm just loving all the energy I've

play01:51

seen throughout the day today.

play01:53

So OK, I want to share with you a talk that

play01:58

is a little bit meant at the high level

play02:00

and an overview of what I have done over the years,

play02:04

through the lens of computer vision

play02:06

and the development of AI.

play02:09

So the title is "What We See & What We Value--

play02:12

AI with a Human Perspective."

play02:15

I'm going to take you back to history a little bit.

play02:18

And when I say history, I meant 540 million years ago.

play02:23

So 540 million years ago, what was that?

play02:26

Well, the Earth is a primordial soup.

play02:30

And it's all living things live in the water.

play02:34

And there aren't that many of them.

play02:36

They are just simple animals floating around.

play02:40

But something really strange happened

play02:42

in geologically a very short period of time,

play02:46

about 10 million years, is from fossil studies scientists have

play02:51

found there is an explosion of the number of animal

play02:55

species around that time.

play02:57

So much that that period is called the Cambrian

play03:01

Explosion or some people call it the Big Bang of evolution.

play03:06

And so what happened?

play03:08

Why suddenly when life was so chill and simple,

play03:12

not too many animals, why life went from that

play03:15

picture to an explosive number of animal species?

play03:20

Well, there are many theories, from climate change

play03:24

to chemical composition of the water, to many things.

play03:28

But one of the leading theories of that Cambrian Explosion

play03:33

is by Andrew Parker, a zoologist from Australia.

play03:37

He conjectured that this speciation explosion

play03:40

is triggered by the sudden evolution of vision, which

play03:44

sets off an evolutionary arms race where animals either

play03:49

evolved or died.

play03:51

Basically, he's saying as soon as you see the first light,

play03:56

you see the world in fundamentally different ways.

play04:00

You can see food.

play04:02

You can see shelter.

play04:03

You can become someone's food.

play04:06

And they would actively prey on you.

play04:08

And you have to actively interact and engage

play04:12

with the world in order to survive and reproduce,

play04:14

and so on.

play04:16

So from that point on, 540 million years to today, vision,

play04:22

visual intelligence has become a cornerstone of the development

play04:26

and the evolution of nervous system of animal intelligence.

play04:30

All the way to, of course, the most incredible visual machine

play04:35

we know in the universe which is the human vision.

play04:39

And whether we're talking about people and many animals,

play04:43

we use vision to navigate the world, to live life,

play04:49

to communicate, to entertain ourselves,

play04:52

to socialize, to do so many things.

play04:56

Well, that was a very brief history of nature's vision.

play05:01

What about computer vision?

play05:03

The history of computer vision is a little shorter

play05:06

than evolution.

play05:07

Urban legend goes around 60 years ago, 1966 I think,

play05:13

that there was one ambitious MIT professor who said, well,

play05:18

AI as a field has been born.

play05:20

And it looks like it's going well.

play05:23

I think we can just solve vision in a summer.

play05:26

In fact, we'll solve vision by using our summer workers,

play05:30

undergrads, and we'll just spend this one

play05:34

summer to create or construct a significant part

play05:39

of visual system.

play05:42

This is not a frivolous conjecture.

play05:45

I actually sympathize with him.

play05:47

Because for humans when you open your eyes,

play05:51

it feels so effortless to see.

play05:55

It feels that as soon as you open your eyes,

play05:59

the whole world's information is in front of you.

play06:02

So it might be turned out to be an underestimation

play06:08

of how hard it is to construct the visual system.

play06:11

But it was a heroic effort.

play06:14

They didn't solve vision in a summer, not even

play06:18

a tiny bit of vision.

play06:19

But 60 years later, vision today has become a very thriving

play06:25

field, both academically as well as in our technology world.

play06:30

I'm just showing you a couple of examples of where we are.

play06:33

Right?

play06:34

We have visual applications everywhere.

play06:37

We're dreaming of self-driving cars, which hopefully

play06:40

will happen in our lifetime.

play06:42

We are using image classification or image

play06:45

recognition and so many image technologies for many things

play06:49

from, health care to just daily lives.

play06:52

And generative AI has brought a whole new wave

play06:57

of visual applications and breakthroughs.

play07:00

So the rest of the talk is organized

play07:03

to answer this question.

play07:04

Where have we come from and where are we heading to?

play07:08

And I want to share with you three major theses

play07:13

of the work that I have been doing

play07:15

in my career in recent few years,

play07:22

and just to share with you what I think.

play07:25

Let's begin with building AI to see what humans see.

play07:30

Why do we do that?

play07:31

Because humans are really good at seeing.

play07:33

This is a 1970s cognitive science experiment

play07:37

to show you how good humans are.

play07:39

Every frame is refreshed at 10 Hertz, 100

play07:43

milliseconds of presentation.

play07:45

If I ask you as audience, I assume given how young you

play07:48

are-- you're not even born then--

play07:50

you've never seen this video.

play07:52

Nod your head when you see one frame that has a person in it.

play07:57

You will see it.

play07:58

Yeah, OK.

play08:00

You've never seen this video.

play08:02

I didn't tell you what the person looked like.

play08:03

I didn't tell you which frame it will appear.

play08:06

You have no idea-- the gesture, the clothes,

play08:09

everything about this.

play08:10

Yet, you're so good at detecting this person.

play08:14

Around the turn of the century, a group of French researchers

play08:18

have put a time on this effortlessness.

play08:22

It turned out seeing complex objects or complex categories

play08:27

for humans is not only effortless and accurate,

play08:31

it's fast.

play08:33

150 milliseconds after the onset of a complex photo,

play08:39

either containing animals or not containing

play08:41

animals, humans you can measure brain signal that

play08:46

shows that differential signal of pictures, of scene pictures

play08:50

with animals and scene pictures without animals.

play08:53

It means that it takes about 150 milliseconds in our wetware,

play08:58

right here, from the photons landing on your retina

play09:03

to the decision that you can make accurately.

play09:06

I know this sounds slow for silicons.

play09:10

But for our brain, for those of you

play09:12

who come from a little bit of neuroscience background,

play09:15

this is actually super fast.

play09:17

It takes about 10 stages of spikes

play09:21

from passing from neuron to neuron to get here.

play09:24

So it's a very interesting measurement.

play09:28

At around the same time, neurophysiologists,

play09:32

so we've had psychologists telling us humans

play09:36

are really good at seeing objects.

play09:38

We've got neuroscientists telling us not only

play09:43

we're good at it, we're fast.

play09:45

Now, this last set of study, also neurophysiologists

play09:49

use MRI study to tell us, because evolution

play09:52

has optimized recognition so much that we have dedicated

play09:57

neural correlates in the brain, areas that specializes

play10:02

in visual recognition.

play10:04

For example, the fusiform face area,

play10:07

or the parahippocampal place area--

play10:10

these are areas that we see objects and scenes.

play10:14

So what all this has told us, this research from the '70s,

play10:20

'80s, and '90s have told us, is that objects

play10:23

are really important for visual intelligence.

play10:26

It's a building block for people.

play10:29

And it's become a North Star for what vision needs to do.

play10:35

It's not it's not all the North Stars,

play10:37

but it's one important North Star.

play10:39

And that has guided the early phase of my own research

play10:46

as well as the field of computer vision.

play10:48

As a field, we identified that object recognition, object

play10:52

categorization, is an important problem.

play10:55

And it's a mathematically really challenging problem.

play10:58

It's effortless for us.

play10:59

But to recognize, say, a cute animal wombat,

play11:03

you actually have mathematically infinite

play11:07

way of rendering this animal wombat from 3D to the 2D

play11:13

pixels, whether it's lighting and texture

play11:18

variations, or background clutter and occlusion

play11:21

variations, or viewing angle camera angle occlusions,

play11:25

and so on.

play11:25

So it's mathematically a really hard problem.

play11:28

So what did we do as a field?

play11:31

I summarized the progress of object recognition

play11:36

in three phases.

play11:38

The first phase was concurrent.

play11:42

It's a very early phase, concurrent with this cognitive

play11:45

studies is what I call the hand-designed features

play11:49

of models.

play11:49

This is where very smart researchers use

play11:53

their own sheer power of their brain

play11:55

to design the kind of building blocks of objects,

play11:59

as well as the model, the parameters, and so on.

play12:03

So we see Geon theory.

play12:04

We see generalized cylinder.

play12:06

We see parts and springs models.

play12:11

And these are in the '70s, '80s, or early '90s.

play12:14

They're beautiful theory.

play12:16

They're mathematically beautiful models.

play12:18

But the thing is, they don't work.

play12:21

They're theoretically beautiful.

play12:24

Then there's a second phase, which

play12:26

I think is the most important phase actually,

play12:29

leading up to deep learning, which is machine learning.

play12:33

It's when we have introduced machine learning

play12:36

as a statistical modeling technique,

play12:40

but the input of these models are hand-designed features

play12:44

like patches, and parts of objects

play12:47

that are meant to carry a lot of semantic information.

play12:51

And the idea is that in order to recognize something

play12:55

like a human body, or a face, or whatever, a chair--

play12:59

it's important to get these patches that contains

play13:03

ears and eyes and whatever.

play13:04

And then you use machine learning models

play13:07

to learn the parameters that stitch them together.

play13:10

And this is when the whole field has experimented

play13:15

with many different kinds of statistical models

play13:18

from Bayes Net, support vector machine, boosting,

play13:24

conditional random field, random forest, and neural network.

play13:28

But this is the first phase of that.

play13:30

Something also important happened concurrently

play13:33

with this phase is actually the recognition of data.

play13:38

In the early years of the 21st century,

play13:44

the field of computer vision recognized

play13:46

it's important to have benchmarking data sets,

play13:49

data sets like the PASCAL VOC data

play13:52

set, the Caltech 101 data set.

play13:54

That is meant to measure the progress of the field.

play13:59

And it turned out they can also become

play14:03

some level of training data.

play14:06

But they're very small.

play14:07

They're in the order of hundreds and thousands of pictures,

play14:11

and a handful of object categories.

play14:16

Personally for me, this was around

play14:19

the time I stumbled upon a very incredible number.

play14:23

I call it, if you read my book, I call it the Biederman number.

play14:26

Professor Biederman who sadly just passed away a year ago,

play14:30

is a cognitive psychologist studying vision and thinking

play14:34

about the scale and scope of human visual intelligence.

play14:38

And back of envelope, he put a guesstimate of humans

play14:43

can recognize 30 to 100,000 object categories

play14:48

in their lives.

play14:50

And it's not a verified number.

play14:54

It's very hard to verify.

play14:55

This is a conjecture in one of his papers.

play14:58

And he also went on to say that by age 6,

play15:02

you actually learn pretty much all the visual categories

play15:05

that a grown-up has learned.

play15:07

This is an incredible speed of learning, a dozen a day or so.

play15:12

So this number bugged me a lot because it just

play15:17

doesn't compare to all the data sets we've seen at that point.

play15:21

And that was the reason, the inception of ImageNet,

play15:25

that we recognized, my students, Jordan, and collaborators,

play15:29

and I recognize that there's a new way of thinking

play15:33

about visual intelligence.

play15:34

It's deeply, deeply data driven.

play15:37

And it's not just the size of the data.

play15:39

It's the diversity of data.

play15:41

And this is really history.

play15:42

You all know what ImageNet is.

play15:44

And it also brought back the most important family

play15:50

of algorithm that is high capacity,

play15:53

and needs to be data driven, which

play15:55

is convolutional or neural network algorithm.

play15:58

And in the case of vision, we started

play16:01

with convolutional neural network.

play16:03

For those of you who are very young students,

play16:06

you probably don't remember this.

play16:08

But even when I was a graduate student

play16:11

at the turn of the century, convolutional neural network

play16:14

was considered a classic algorithm,

play16:18

meaning it was pretty old.

play16:20

And it didn't work.

play16:21

But we still studied it when I was a graduate student.

play16:24

It was incredible to see how data and the new techniques

play16:29

revitalized this whole family of algorithms.

play16:32

And for this audience, I'm going to skip.

play16:34

This is really too trivial.

play16:36

But what happened is that this brought us the third phase

play16:41

of object recognition.

play16:43

And I would say more or less, quite a triumphant phase

play16:47

of object recognition, where using big data as training

play16:51

and convolutional neural network,

play16:53

we're able to recognize objects in the wild in a way that

play16:58

the first two phases couldn't.

play17:00

And these are just examples.

play17:02

And of course, the most incredible moment,

play17:06

even for myself who was behind ImageNet,

play17:10

was 2012 when Professor Geoff Hinton and his students,

play17:14

very famous students, have written this defining

play17:19

paper as the beginning of the deep learning revolution.

play17:23

And ever since then, vision as a field and ImageNet as a data

play17:29

set has really been driven a lot of the algorithm advances

play17:34

in the pre-transformer era of deep learning.

play17:39

And very proudly as a field, even work like RESNET,

play17:44

were the precursors of many of the attention

play17:48

is all you need paper.

play17:52

So vision as a field has contributed a lot

play17:55

to deep learning evolution.

play17:58

OK, so let me fast forward.

play18:01

As researchers, after ImageNet, we

play18:04

were thinking about what is beyond object recognition.

play18:08

And this is really Ranjay's thesis work,

play18:10

is that the world is not just defined by object identities.

play18:16

If it were, these two pictures both contain

play18:20

a person and a llama, would mean the same thing.

play18:23

But they don't.

play18:27

I'd rather be the person on the left

play18:30

than the person on the right.

play18:32

Actually, I'd rather be the llama

play18:33

on the left than the llama on the right as well.

play18:37

So objects are important, but relationships, context,

play18:44

and structure and compositionality of the scene

play18:47

are all part of the richness of visual intelligence.

play18:51

And the image, that was not enough to push forward

play18:55

this kind of research.

play18:56

So again, heroically Ranjay was really

play19:00

the key student who was pushing a new way of thinking

play19:04

about images and visual representation,

play19:08

mostly focusing on visual relationships.

play19:11

So the way Ranjay and we put together the next wave of work

play19:19

was through scene graph representation.

play19:21

We recognize the entities of the scene in the unit of objects,

play19:28

but also their own attributes as well as

play19:31

the inter-entity relationships.

play19:33

And we made a data set-- it was a lot of work--

play19:38

called Visual Genome.

play19:40

That consisted of hundreds of thousands of images,

play19:46

but millions of relationships, attributes, objects,

play19:50

and even natural language descriptions

play19:54

of the images as a way to capture the richness

play19:58

of the visual world.

play20:00

There are many works that came out of Visual Genome,

play20:04

and a lot of them were written by Ranjay.

play20:06

But one of my favorite works is this one-shot learning

play20:10

of visual relationships that Ranjay

play20:14

did where you use the compositionality of the objects

play20:21

to learn relationships like people riding

play20:24

horse, people wearing hats.

play20:26

But what comes out of it with the compositionality is almost

play20:30

for free, is the capability of recognizing

play20:33

long-tail relationships that you will never

play20:36

have enough training examples.

play20:38

But you're able to do it during inference,

play20:40

which is like horse wearing hat, or person

play20:44

sitting on fire hydrant.

play20:46

And that really taps into the relationship as well as

play20:50

the compositionality of images.

play20:53

And yeah, there were some quantitative measurement that

play20:57

shows our work at that time-- now it's ancient time--

play21:00

that does better than the state of the art.

play21:04

We also went beyond just a contrived labeling

play21:09

of objects or relationships that went into natural language.

play21:13

And there was a series of papers started with my former student

play21:17

Andre Karpathy, many of you know, Justin Johnson,

play21:20

on image captioning, dense captioning, paragraph

play21:24

generation.

play21:25

I want to say one thing that shows you

play21:27

how badly at least me or oftentimes scientists

play21:32

predicts the future.

play21:33

When I was a graduate student, when

play21:36

I was about to graduate, 2005, I remember

play21:40

it was very clear to me my life dream as a computer vision

play21:45

scientist was to, when I die, I want

play21:49

to see computers can tell a story from a picture.

play21:55

That was my life's dream.

play21:56

I feel that if we can put a picture into the computer

play22:00

and it will tell us what's happening, a story,

play22:04

we've achieved the goal of computer vision.

play22:08

I never dreamed less than 10 years, just around 10 years

play22:12

after my graduation, this dream was realized collectively,

play22:17

including my own lab, by LSTM at that point, and CNNs.

play22:25

It was just quite a remarkable moment for me to realize.

play22:29

First of all, it's kind of the wrong dream

play22:31

to say that that's the end of the computer vision

play22:34

achievement.

play22:35

Second, I didn't know how fast it would come.

play22:38

So be careful what you dream of.

play22:40

That was the moral of the story.

play22:43

But static relationships are easier.

play22:48

Real world is full of dynamic relationships.

play22:51

Dynamic relationships are much more nuanced

play22:54

and more difficult. So this is fairly recent work.

play22:58

It was I think at NeurIPS two years ago.

play23:01

And we're still doing this work on multi-object, multi-actor

play23:07

activity recognition or understanding.

play23:10

And that is an ongoing work.

play23:12

I'm not going to get into the technical details.

play23:15

But the video understanding, especially

play23:19

with this level of nuance and details, still excites me.

play23:23

And it's an unsolved problem.

play23:26

I also want to say that vision as a field has been exciting,

play23:30

not only because I'm doing some work in it.

play23:32

It's because some other people's work.

play23:34

And none of these are my own work.

play23:37

But I find that the recent progress

play23:41

in 3D vision, in pose estimation,

play23:44

in image segmentation, with Facebook SAM and all

play23:49

the generative AI work has been just incredible progress.

play23:55

So we're not done with building AI to see what humans see.

play23:59

But we have gone a long way.

play24:01

And part of that is the result of data,

play24:04

compute, algorithms, like neural networks

play24:07

that really brought this deep learning revolution.

play24:10

And as a computer vision scientist,

play24:12

I'm very proud that our field has contributed to this.

play24:16

And AI's development has been and I

play24:19

continue to believe will be inspired by brain

play24:22

sciences and human cognition.

play24:23

And for this section, I'm very appreciative of all

play24:28

the collaborators, current and former students,

play24:31

and Ranjay you're a part of them, who has contributed.

play24:35

Let's just fast forward to building AI

play24:37

to see what humans don't see.

play24:39

Well, I just told you humans are super good.

play24:42

But I didn't tell you that we're not good enough.

play24:45

For example, I don't know about you,

play24:48

but I don't think I can recognize all these dinosaurs.

play24:52

And in fact, recognizing very fine-grained objects is not

play24:59

something humans are good at.

play25:01

There are more than 10,000 types of birds in the world.

play25:05

We put together or we got our hands

play25:08

on a data set of 4,000 types of birds.

play25:11

And humans typically fail miserably

play25:13

in recognizing all species of birds.

play25:17

And this is an area called fine-grained object

play25:20

categorization.

play25:21

And in fact, it's quite exciting to think

play25:23

about computers at this point can go beyond human ability

play25:27

to train detectors, object detectors, that

play25:30

can do much finer grain understanding of objects

play25:35

beyond humans.

play25:36

And one of the application papers

play25:38

we did which I find very fascinating,

play25:42

is a fine-grained car recognition.

play25:44

We downloaded 3,000 types of cars, separated by make, model,

play25:50

year that's ever built by 1970s, starting 1970s.

play25:55

We stopped before Tesla was popular.

play25:58

So people always ask me this question.

play26:00

Where's Tesla?

play26:01

We don't have Tesla.

play26:02

And after we trained the fine-grained object detector

play26:07

for thousands of cars, 3,000 of cars,

play26:09

we downloaded street view pictures

play26:13

of 100 American cities, most populated cities,

play26:17

two per state.

play26:19

And we also correlated this with all the census

play26:24

data that came out of 2010.

play26:26

And it's incredible to see the world through vision as a lens,

play26:31

the correlation between car detection and human society

play26:37

is stunning, including income, including education level,

play26:43

including voting patterns.

play26:45

We have a long paper that has dozens and dozens

play26:48

of these correlations.

play26:49

So I just want to show you that even though we don't see it

play26:54

with our individual eyes, but computers can help us see

play26:58

our world, see our society through these kind of lenses

play27:01

in ways that humans can't.

play27:04

OK, to drive home this idea that humans are not that good, even

play27:08

though 10 minutes ago I told you're so good,

play27:11

is this visual illusion called Stroop test.

play27:15

Try to read out to yourself the color of the word, not

play27:21

the word itself.

play27:23

Go left to right and top to bottom, as fast as possible.

play27:27

It's really hard, right?

play27:28

I have to do red, orange, green, blah, blah, blah.

play27:35

That's a fun visual illusion.

play27:37

This one some of you probably have seen.

play27:40

These are two alternating pictures.

play27:42

They look like the same but there's

play27:44

one big chunk that's different.

play27:46

Can you tell?

play27:49

Raise your hand if you can.

play27:52

It's an IQ test.

play27:53

[LAUGHTER]

play27:55

m so all the faculty were thinking, oh no.

play28:01

I didn't raise my hand.

play28:04

OK, so it's the engine.

play28:09

Oh.

play28:10

OK, so it's a huge chunk.

play28:13

This has landed on your retina.

play28:16

And you completely missed it.

play28:19

OK, good job.

play28:20

[LAUGHTER]

play28:22

It's not that funny, if it's in the real world, when

play28:26

it's a high stake situation.

play28:28

Whether you're passing through airport security or doing

play28:32

surgeries.

play28:34

So actually not seeing can have dire consequences.

play28:38

Medical error is the third-leading cause

play28:40

of American patients' deaths annually.

play28:44

And in surgery rooms, accounting for all the instruments

play28:49

and glasses and all that is actually a critical task.

play28:53

If something is missing, on average

play28:56

a surgery will stop for more than one hour,

play28:59

so that the nurses and doctors have

play29:02

to identify where the thing is, and think about all the life

play29:05

risk to the patient.

play29:06

And what do we do today?

play29:08

We use hand and count.

play29:11

And imagine if we can use computer vision

play29:15

to automatically assist our doctors

play29:17

and nurses to account for small instruments

play29:21

in a surgical setting.

play29:23

That would be very helpful.

play29:25

And this is an ongoing collaboration

play29:27

between my lab's health care team and Stanford Hospital

play29:33

Surgery Department.

play29:34

This is a demo of accounting for these glasses

play29:38

during a surgical scenario.

play29:42

And this would, if this becomes mature technology,

play29:45

I really hope this would have really good application

play29:48

for these kind of uses.

play29:51

Sometimes seeing is not just attention.

play29:56

Every example I just showed you there seemed

play29:59

to be attentional deficit.

play30:00

But sometimes seeing is more profound, or not seeing

play30:03

is more profound.

play30:04

This is my really favorite visual illusion,

play30:07

since I was a graduate student, made by Ted Edison at MIT.

play30:12

And I'm just showing you the answer.

play30:14

This checkerboard illusion, if you

play30:16

look at the top graph checkerboard A and B,

play30:21

no matter what I tell you they look

play30:23

like different gray scales, right?

play30:27

I mean, how could they on Earth have the same gray scale.

play30:31

But if I added this, you see that they're

play30:35

the same gray scale.

play30:36

So this is a visual illusion.

play30:38

Even if you know the answer, It's

play30:40

hard to not be tricked by your eyes.

play30:44

For those of you who are old enough, who do you see here?

play30:49

AUDIENCE: Bill Clinton and Al Gore.

play30:51

FEI-FEI LI: Clinton and Gore, right?

play30:53

Is it?

play30:55

Is it Clinton and Gore?

play31:00

So it turned out they are Clinton and Clinton.

play31:03

And it's a copy of Clinton's face in Gore's hair,

play31:07

and in a context, that it is very primed

play31:14

for all of us to see them as Clinton and Gore.

play31:18

So being primed is a fundamental thing of human bias.

play31:23

And in computer vision, we have also inherited,

play31:27

if we're not careful, computer vision

play31:29

has inherited human bias, especially through data sets.

play31:34

So Joy Buolamwini used to be at MIT,

play31:38

had written this beautiful poem that exposes

play31:42

the bias of computer vision.

play31:45

So I'm not nearly as a leading expert

play31:49

as Joy and many other people are.

play31:51

But it's important to point out that not seeing has

play31:55

consequences.

play31:56

And we need to work really hard to combat

play32:00

these biases that creep into computer vision and AI systems.

play32:04

And these are just really examples of hundreds

play32:07

and hundreds of thousands of papers

play32:09

and work people are doing in combating biases.

play32:12

Now on the flip side, sometimes not seeing is a must,

play32:18

as seeing too much is also really bad.

play32:22

This brings us to the value of privacy.

play32:24

And my lab has been actually doing

play32:27

quite a bit in the context of health care,

play32:30

but quite a bit of privacy computing in the past few years

play32:33

in terms of how we can protect human dignity, human identity,

play32:38

in computer vision context.

play32:40

One of my favorite works that's not led by me

play32:43

is by Juan Carlos Niebles.

play32:44

That combines both hardware and software

play32:48

to try to protect human privacy while still recognizing

play32:53

human behaviors that are important.

play32:56

The idea is the following.

play32:57

If you want to look at what humans do,

play33:00

you take a camera you shoot a video and you analyze it.

play33:03

In this case, a baby is pushing a box.

play33:06

What if you don't want to reveal this kid?

play33:10

What if you don't want to reveal the environment?

play33:13

Can you design a lens that blurs the raw signal, like you never

play33:18

take the pure pixel signal?

play33:21

What if the designed lens gives you a signal like that?

play33:24

So for humans, you don't even see the baby.

play33:27

Well, that's exactly what they did.

play33:29

They designed a warped lens.

play33:33

And the lens gives you a raw signal in the top row.

play33:37

But they also designed an algorithm

play33:40

that retrieves not super resolution,

play33:44

they have no intention to recover

play33:47

the identity of the people, but just to recover

play33:50

the activity they need to know.

play33:52

This way their combined hardware-software approach

play33:57

not only protects privacy, but also reads out

play34:00

the insight that whether you're in transportation cases

play34:03

or health care cases, that is relevant to the application

play34:07

users.

play34:08

So building AI to see what humans don't see

play34:12

is part of computer vision's quest.

play34:16

It's also important to recognize sometimes what humans

play34:20

don't see is bad, like bias.

play34:23

But we also want to make computer not

play34:29

see the things that we want to preserve privacy for.

play34:33

So in general, AI can amplify and exacerbate

play34:38

many profound issues that has plagued human society for ages,

play34:44

and we must commit to study and forecast

play34:47

and guide AI's impact for human and society.

play34:51

And many students and former students

play34:53

have contributed to this part of the work.

play34:56

Let's talk about building AI to see what humans want to see.

play35:01

And this is where really putting humans

play35:04

more in the center of designing technology to truly help us.

play35:10

When you hear the word AI, well, you're

play35:13

kind of a biased audience.

play35:15

But when the general public hears

play35:17

about AI today, what is the number one

play35:19

thing that comes to their mind?

play35:22

Anxiety, right?

play35:24

A lot of that anxiety is labor landscape, jobs.

play35:30

And this is very important.

play35:32

And if you go to headlines of news,

play35:35

every other day we see that.

play35:37

But there is actually a lot of cases

play35:40

where human labor is in dire shortage.

play35:44

And again, this brings me back to the health care industry

play35:47

that I also work with.

play35:50

America was missing at least 1 million nurses last year.

play35:56

And the situation is just worse and worse.

play35:59

I talked about the medical error situation

play36:02

in our health care system.

play36:04

The aging society is exacerbating the issue

play36:08

of lack of caretakers.

play36:09

And a lot of these burdens fell on women and people of color

play36:14

in very unfair ways.

play36:16

Care-taking is not even counted in GDP.

play36:19

So instead of thinking about AI replacing human capability,

play36:24

it is really valuable to think about AI augmenting humans,

play36:28

and to lift human jobs, and to also

play36:31

give human a hand, especially health care

play36:35

from a vision perspective.

play36:37

There are so many times and so many scenarios

play36:40

that we're in the dark.

play36:42

We don't know how the patient is doing.

play36:44

We don't know if the care delivery is high quality.

play36:50

We don't know where that small instrument was

play36:53

missing in the surgical room.

play36:55

We don't know if we're making a pharmaceutical error that

play36:58

might have dire consequences.

play37:01

So in the past 10 years, my lab and I and my collaborators

play37:07

have started this fairly new area of research called

play37:10

ambient intelligence for health care, where

play37:13

we use smart sensors, mostly depth sensors and cameras,

play37:17

and machine learning algorithms to glean

play37:19

health critical insights.

play37:21

Most of this earlier work was summarized

play37:24

in this Nature article called "Illuminating

play37:27

the Dark Spaces of Healthcare with Ambient Intelligence."

play37:30

I'll just give you a couple of quick examples.

play37:33

One case study is hand hygiene.

play37:36

We started this work way before COVID.

play37:39

Everybody thought this is the most boring project.

play37:42

But when COVID came, it became so important.

play37:45

It turned out that hospital acquired

play37:47

infection kills three times more people in America

play37:53

than car accidents every year.

play37:55

And a lot of that is because of doctors and nurses

play38:00

carrying germs and bacteria from room to room.

play38:03

So WHO has very specific protocols for hand hygiene.

play38:09

But humans make mistakes.

play38:11

And now the way to monitor that by hospitals

play38:15

is very expensive, sparse, and disruptive.

play38:19

They put humans in front of--

play38:21

I don't know the patient rooms, and try to remind the doctors

play38:25

and nurses.

play38:26

You can see this is completely non-scalable.

play38:28

So my students and I have been collaborating

play38:33

with both Stanford Children's Hospital and Utah's

play38:36

Intermountain Hospital by putting depth sensors in front

play38:39

of these hand hygiene gel dispensers,

play38:43

and then using video analysis and activity recognition

play38:48

system to watch if the health care

play38:51

workers are doing the right thing for hand hygiene.

play38:54

And quantitatively, the bottom line

play38:57

is the ground truth of human behavior.

play39:00

You can see that the computer vision algorithm's precision

play39:04

and recall is very high compared to even human observers

play39:09

that we put in the hospital in front of the hospital room.

play39:13

Another example is ICU Patient Mobility Project

play39:18

where we getting patient to move in the right way in the ICU

play39:24

is really important.

play39:26

It helps our patients to recover.

play39:27

And on top of that, ICU is so important.

play39:30

It's 1% of US GDP is spent in ICU.

play39:35

Health care is 18%.

play39:37

So this is where patients fight for death and life.

play39:41

And we want to help them to recover.

play39:45

We work with Stanford Hospital to put these sensors,

play39:49

again RGBD sensors in ICU rooms.

play39:53

And we study how the patients are being moved.

play39:58

Some of the important movements that doctors want patients

play40:03

to do include getting out of bed, getting in bed,

play40:06

getting in chair, getting out of chair, these things.

play40:09

And we can use computer vision algorithm

play40:11

to help the doctors and nurses to track these movements and so

play40:16

on.

play40:16

So this is, again, a preliminary work.

play40:19

Last but not least, aging in place.

play40:21

Aging is very important.

play40:23

But how do we keep our seniors safe, healthy, but also

play40:28

independent in their living?

play40:30

How do we call out early signs of whether it's

play40:35

infection or mobility change, sleep disorder, dietary issues?

play40:41

There are so many things.

play40:43

It's computer vision plays a big role in this.

play40:46

We are just starting to collaborate actually

play40:49

with Thailand and Singapore right now

play40:51

to get these computer vision algorithms

play40:56

into the homes of seniors, but also keeping in mind

play40:59

the privacy concerns.

play41:01

So these are just examples.

play41:03

Last but not the least, I'm actually still

play41:06

very excited by the long future where I think no matter

play41:12

what we do, we probably will enter a world

play41:17

where robots collaborate with humans

play41:20

to make our lives better.

play41:22

So ambient intelligence is passive sensors.

play41:24

It can do certain things.

play41:26

But eventually I think embodied AI will be very, very important

play41:32

in helping people, whether it's firefighters, or doctors,

play41:36

or caretakers, or teachers, or so on.

play41:40

And technically, we need to close

play41:43

the loop between perception and action

play41:46

to bring robots or embodied AI to the world.

play41:49

Well, the gap is still pretty high.

play41:55

This is a robot.

play41:57

I think-- I don't know.

play42:00

It's a Boston Dynamics robot or some kind of robot.

play42:03

It's a pretty miserable robot trying to put a box

play42:08

and miserably failed.

play42:10

And I know there are so many-- robotic research is also

play42:15

really progressing really fast.

play42:17

So it's not fair to just show that one example.

play42:19

But in general, we are still a lot of robotic learning

play42:24

and robotic research right now is still

play42:26

on skill level tasks, short horizon goals,

play42:29

and closed world instruction.

play42:32

I want to share with you one work that

play42:35

at least was attempted towards robotic learning

play42:38

to open world instruction.

play42:40

It's still not fully closing all the gap,

play42:44

and I don't claim to do so.

play42:45

But at least we're working on one dimension.

play42:49

And that is some of you know our work

play42:52

VoxPoser, just released half a year ago.

play42:56

Where we look at a typical robotic task

play43:01

such as open the door, or whatever,

play43:04

a robotic task in the wild.

play43:06

And the idea in today's robotic learning is you give a task,

play43:13

and you try to give a training set,

play43:15

and then you try to train an action model.

play43:21

And then you test it.

play43:23

But the problem is, how do you generalize?

play43:27

How do you hope in the wild generalization?

play43:31

And how do you hope that instruction can be open world?

play43:34

And here's the result. The focus of this work

play43:41

is motion planning in the wild or using open vocabulary.

play43:46

And the idea is to actually borrow

play43:49

from large language models.

play43:52

From large language model, to compose the task,

play43:56

and from also a visual language model

play43:59

to identify the goal and also the obstacles,

play44:02

and then use a code generated 3D value map

play44:07

to guide to do motion planning.

play44:09

And I'm not going to get into this.

play44:10

But quickly, so once the robot takes the instruction,

play44:14

open the top drawer, you use LLM to compose the instruction.

play44:20

And because the LLM helps you to identify the objects as well

play44:26

as the actions, you can go use a VLM, visual language model,

play44:32

to identify the objects that you need in the world.

play44:36

Every time you do that, you're starting

play44:37

to update a planning map.

play44:41

And it helps to, in this case you identify the drawer.

play44:46

The maps sets some values and it focuses on the drawer.

play44:50

And if you give it an additional instruction of watch out

play44:54

for the vase, and it goes back to LLM and goes back to VLM,

play44:59

and they identify the vase.

play45:01

And then it identifies the planning path

play45:08

with the obstacle, and updates the value map,

play45:12

and recomputes the motion map, and do it

play45:19

recursively till it has more optimized this.

play45:23

So this is the example we see in simulation in real world.

play45:30

And there are several examples of doing

play45:34

this for articulated objects, deformable manipulations,

play45:40

as well as just everyday manipulation tasks.

play45:44

OK, in the last three minutes, let

play45:46

me just share with you one more project, then we're done.

play45:50

Is that even with VoxPoser, which

play45:54

I just showed you, and many other projects in my lab,

play45:57

I always feel in the back of my mind

play46:00

that compared to where I come from,

play46:02

which is the visual world, is these

play46:04

are very small scale data.

play46:06

Very small scale anecdotal experimental setup,

play46:10

and there is no standardization, and the tasks

play46:15

were more or less lab specific.

play46:17

And compared to the real world which

play46:23

is so complex, so dynamic, so variable, so interactive,

play46:28

and so multitasking it's just unsatisfying.

play46:31

And how do we make progress in robotic learning?

play46:34

Vision and NLP has already shown us

play46:39

that large data drives learning so much,

play46:43

and the kind of effective benchmarking drives learning.

play46:47

So how do we combine the goal of large data

play46:51

and effective benchmarking for robotic learning

play46:54

has been something on my mind.

play46:57

And this is the new project that we have been doing.

play47:00

Actually, it's not so new anymore,

play47:01

for the past three years called BEHAVIOR,

play47:05

benchmark for everyday household activities

play47:08

in virtual interactive ecological environments.

play47:12

And let me just cut to the chase.

play47:16

Instead of small anecdotal tasks that we want to train robots

play47:22

on, we want to do 1,000 tasks, 1,000 tasks that

play47:26

matter to people.

play47:27

So we started actually by a human centered approach.

play47:31

We literally go to thousands of people and ask them, would

play47:35

you like a robot to help you with--

play47:38

so let's try this.

play47:39

Would you like a robot to help you

play47:41

with cleaning kitchen floor?

play47:46

Yeah, sort of, mostly.

play47:48

OK.

play47:49

Shoveling snow?

play47:51

Yeah.

play47:52

Folding laundry?

play47:54

AUDIENCE: Yeah.

play47:55

FEI-FEI LI: Yeah, OK.

play47:56

Cooking Breakfast

play47:57

[INTERPOSING VOICES]

play47:59

FEI-FEI LI: OK, I don't know.

play48:00

I get mixed--

play48:02

Ranjay wants everything.

play48:06

I get mixed reviews.

play48:08

OK, this one, opening Christmas gift?

play48:11

AUDIENCE: No.

play48:11

FEI-FEI LI: Right, yeah exactly.

play48:13

OK, I'm glad you're not a robot, Ranjay.

play48:16

So we actually took this human centered approach.

play48:19

We went to the government data of American

play48:23

and other countries human's daily activities.

play48:28

We go to crowdsourcing platform like Amazon Mechanical Turk.

play48:32

We ask people what they want robots to do.

play48:35

And we rank thousands of tasks.

play48:38

And then we look at what people want help with,

play48:41

and what people don't want help with.

play48:43

It turned out cleaning, all kinds of cleaning people hate.

play48:46

But opening Christmas gift or buying a ring, or mix baby

play48:52

cereals, is actually really important for humans.

play48:55

We don't want robots help.

play48:57

So we took the top 1,000 tasks that people

play49:03

want robots help, and put together

play49:06

the list for BEHAVIOR data set.

play49:09

And then we actually scanned 50 real world environments

play49:16

across eight different things, like apartments, restaurants,

play49:19

grocery stores, offices, and so on.

play49:22

And this compared to one of my favorite works from UW,

play49:28

Object Verse, is very small.

play49:30

But we got thousands and thousands of object assets.

play49:34

And we created a simulation environment.

play49:42

OK, all right.

play49:45

I want to actually give credits to a lot of good work

play49:49

that came out of UW and many other places.

play49:52

So robotic simulation is actually

play49:54

a very interesting area of research and excellent work,

play49:58

like Ai2THOR, Habitat, Sapien has been also making

play50:04

a lot of contribution.

play50:05

We collaborated with NVIDIA, especially the Omniverse group,

play50:10

to try to focus on creating a realistic simulation

play50:16

environment for robotic learning that

play50:18

has the good physics, like thermal transitional lighting

play50:22

and all that; good perception which we did some user

play50:26

studies to show that we have very

play50:30

good perceptual experience; and also just interactions.

play50:36

And I'm not going to get into all the details.

play50:39

We did some comparisons and show the strength

play50:42

of this BEHAVIOR environment for training 1,000 robotic tasks.

play50:49

And right now we are working on a whole bunch of work

play50:54

that is involving benchmarking, robotic learning,

play50:58

multi-sensory robotics, and even economic studies

play51:05

on the impact of household robots.

play51:10

And OK, I actually want to say one thing I'm not showing here.

play51:17

Is that we are actually doing brain

play51:21

robotic interfacing, using BEHAVIOR environment

play51:25

to use EEG to drive robotic arms to show the brain robot

play51:32

interface.

play51:33

And that was just published this quarter.

play51:35

So I didn't include this slide.

play51:38

So BEHAVIOR is becoming a very rich research environment

play51:45

hopefully for our community, but at least for our lab's

play51:49

robotic work.

play51:50

And of course, the goal is one day

play51:52

we'll close the gap between robotics and collaborative

play51:58

robots, home robots that can help people.

play52:01

And this part of the research is really

play52:05

trying to identify problems, whether it's

play52:07

health care or embodied AI, where

play52:10

we want to build the AI to see and also

play52:13

to do what humans want it to, whether it's helping

play52:16

patients or helping elderlies.

play52:19

And I think that's the key emphasis

play52:22

is really augmentation.

play52:24

And a lot of collaborators have participated

play52:27

in this part of the work.

play52:30

This really summarizes the three phases of our work

play52:35

or three different types of our work, and all of this

play52:39

have accumulated to what I would call a human centered AI

play52:43

approach, where we recognize it's so important to develop AI

play52:48

with a concern for human impact.

play52:51

It's so important to focus AI to augment and enhance humans.

play52:57

And it's actually intellectually still important

play53:00

to be inspired by human intelligence

play53:02

and cognitive sciences and neurosciences.

play53:06

And that was really the foundation of Stanford's Human

play53:09

Centered AI Institute that I co-founded and launched

play53:13

five years ago with faculty from English, Medicine, Economics,

play53:19

Linguistics, Philosophy, Political Science, Law Schook,

play53:24

and all that.

play53:25

And HAI has been around for five years now almost.

play53:30

We do work from digital economy to Center

play53:34

for Research for Foundational Models,

play53:36

where some of our workers--

play53:39

like Percy, Chris-- you guys all know

play53:41

them-- are at the forefront of benchmarking and evaluating

play53:45

today's LLMs.

play53:47

And we also work with faculty like Michael Bernstein, some

play53:52

of him very well, on creating ethics and society review

play53:57

process for AI research.

play53:59

And we also focus on educating not only

play54:05

ethics focused AI to our undergrads,

play54:08

but also really bring that education to the outside world,

play54:12

especially for policymakers, as well as business executives.

play54:17

And we directly engage with the national policy, Congress

play54:24

and Senate and White House to advocate for public sector AI

play54:29

investment, especially right now.

play54:33

In fact, UW is part of the partner

play54:35

and also senators from Washington state

play54:38

are extremely important for this is to advocate the next bill

play54:43

for national AI research cloud.

play54:46

So this really concludes my talk.

play54:49

That was a pretty dense quick overview

play54:53

of a human centered approach to AI,

play54:55

and I'm happy to take questions.

play54:58

[APPLAUSE]

play55:04

One more slide.

play55:07

PRESENTER: We have time for maybe two questions.

play55:13

AUDIENCE: What do you think the most interesting breakthrough

play55:15

in the next 5 or 10 years is going to be in computing?

play55:19

FEI-FEI LI: The question is, what do I

play55:21

think the most interesting breakthrough

play55:22

in the next 5 or 10 years.

play55:24

I just told you in the talk, I'm so bad at predicting.

play55:31

So I think the two things that does excite me,

play55:34

one is really just deepening AI's impact

play55:39

to so many applications in the world.

play55:43

It's not necessarily yet another transformer or anything.

play55:47

It's just that we have gotten to a point,

play55:49

the technology has so much power and capability.

play55:52

We can use this to do scientific discovery,

play55:56

to make education more personalized,

play56:00

to help health care, to map out the biodiversity of our globe.

play56:06

So I think that deepening and widening of AI applications

play56:10

or from an academic point of view,

play56:12

that deepening and widening of interdisciplinary AI

play56:16

is one thing that really excites me for the next 5 to 10 years.

play56:20

On the technology side, I'm totally biased.

play56:23

I think computer vision is due for another revolution.

play56:27

We're at the cusp of it.

play56:29

There's just so much that is converging.

play56:32

And I'm really excited to see the next wave of vision

play56:36

breakthroughs.

play56:40

PRESENTER: Go ahead.

play56:41

AUDIENCE: So large language models

play56:45

have been impressive because of what

play56:47

they have been able to do with semantic understanding.

play56:50

What do you think the frontier for image, computer vision

play56:54

is in that respect?

play56:57

FEI-FEI LI: Yeah.

play56:58

This is a very good question.

play57:00

The question is large language model

play57:02

is really encoding semantics so well.

play57:04

What's the frontier of image?

play57:07

So let me just say something.

play57:08

First of all, the world is fundamentally very rich.

play57:13

Its language-- Ranjay, don't yell at me.

play57:17

I still think language is a lossy compression of the world.

play57:23

It is very rich.

play57:25

It goes beyond just describing the world.

play57:29

It goes into reasoning, abstraction, creativity,

play57:33

intention, and all this.

play57:35

But much of language is symbolic, is a compression.

play57:40

Whereas the world itself in 3D in 4D is very, very rich.

play57:46

And I think there needs to be a model.

play57:49

The world deserves a model.

play57:51

Not just language deserves a model.

play57:53

There needs to be a new wave of technology

play57:57

that really fundamentally understands

play57:59

the structure of the world.

play58:04

PRESENTER: OK, we have time for one more.

play58:06

Go ahead.

play58:07

AUDIENCE: I really agree that language

play58:09

can be lossy, like compression of the real world.

play58:12

I'm just wondering, what's your opinion on just how

play58:16

English as a whole is just so much like dominating

play58:19

the research field itself, like all these

play58:21

labeled data sets are labeled in English, while other language

play58:24

might have different ways of describing objects,

play58:27

describing the relationship between objects?

play58:30

That lack of diversity, how do you feel about it?

play58:33

FEI-FEI LI: Right.

play58:34

So the question is about bias of English in our dominating data

play58:39

sets of our AI.

play58:40

I think you're calling out a very important aspect of what

play58:45

I call the inherited human bias, right?

play58:48

Our data sets inherit that kind of bias.

play58:54

I do want to say one thing.

play58:55

This is not meant for defense.

play58:57

It's a fun fact that when we were constructing ImageNet,

play59:01

because the ImageNet was--

play59:03

George Miller made this lexicon taxonomy in many languages.

play59:08

It was so nice and easy to map the synsets of English ImageNet

play59:14

to French, Italian, Spanish, Portuguese.

play59:17

I think there are also Asian languages we used.

play59:21

And so even though ImageNet seemed English to you.

play59:25

The data comes from all languages,

play59:27

we could get our hands on the license.

play59:31

But that doesn't really solve the problem you're saying.

play59:34

I think you're right.

play59:35

I mean we have to be really mindful,

play59:37

even in the BEHAVIOR data set, when

play59:40

we're looking at human daily activities,

play59:44

we started with the US government data.

play59:49

We realized we're very biased.

play59:50

First of all, you realize you're biased because there's so

play59:54

much TV watching in the data.

play59:58

And then we actually went to Europe.

play60:02

But that does not include the global South.

play60:04

So we're definitely still very biased.

play60:07

PRESENTER: OK, I think that's all the time we have.

play60:09

Let's thank Fei-Fei.

play60:11

FEI-FEI LI: Thank you.

play60:12

[APPLAUSE]

Rate This

5.0 / 5 (0 votes)

Étiquettes Connexes
人工智能计算机视觉李飞飞斯坦福深度学习ImageNet健康护理机器人学习数据集多学科
Besoin d'un résumé en anglais ?