特斯拉自动驾驶的“通用世界模型”和视频生成技术|Ashok23年CVPR主题演讲

瓦砾村夫
17 Mar 202419:54

Summary

TLDR特斯拉Autopilot团队的Asha Kisami介绍了他们的自动驾驶技术进展。目前,全自动驾驶Beta软件已在美国和加拿大的约400,000辆车上运行,行驶里程超过5亿英里。他们的自动驾驶堆栈主要基于八个摄像头,提供360度全方位覆盖,利用现代机器学习技术,尤其是神经网络,来处理转弯、交通灯和与其他物体的互动。他们还开发了一种基于占用网络的3D空间预测技术,以及预测未来行人和车辆流动的模型。特斯拉正在构建一个通用的世界模型,通过大量视频剪辑和先进的生成模型来训练,以实现更准确的未来预测。此外,特斯拉还在开发Dojo,一种定制的训练硬件,以支持这些基础模型的大量计算需求。

Takeaways

  • 🚗 特斯拉全自动驾驶(FSD)软件已向美国和加拿大购买此服务的约400,000辆车辆推出,这些车辆已累计行驶约2.5亿英里。
  • 📸 特斯拉FSD的核心是一个基于现代机器学习的系统,主要依赖车辆上的8个摄像头提供的360度全景视图,不同于传统的依赖本地化地图和雷达的自动驾驶技术。
  • 🧠 特斯拉的自动驾驶技术将多个自驾驶组件集成到神经网络中,包括使用大型变压器模型进行空间和时间注意力的计算。
  • 🛣️ 特斯拉开发了一种状态-of-the-art的生成模型,用于实时预测道路线和移动物体,这些预测不仅仅基于摄像头视频流,还包括车辆自身的运动信息和导航指令。
  • 🔮 特斯拉正在开发一种更通用的世界模型,这个模型能够基于过去的数据和条件预测未来的状态,这可能会对自动驾驶技术产生重大影响。
  • 🎓 该技术的成功依赖于特斯拉强大的自动标记系统,该系统可以处理来自全球范围内数百万视频片段的数据,以构建精确的3D场景重建和标签。
  • 🚦 特斯拉的自动标记技术能够无需人工干预地准确标记交通灯、道路线等关键信息,极大地提高了数据处理效率和精度。
  • 🌍 特斯拉的技术不仅限于汽车,还旨在跨越不同的机器人平台,展现了强大的通用性和适应性。
  • 💻 为了支持这些先进模型的训练,特斯拉正在成为全球计算能力领先者,开发了名为Dojo的自定义训练硬件。
  • 🤖 特斯拉强调其技术的核心是建立一套基础模型,这套模型能够理解世界上的各种复杂情况,并且这些模型将在接下来的12到18个月内进一步发展。

Q & A

  • 特斯拉自动驾驶团队的核心研究方向是什么?

    -特斯拉自动驾驶团队的核心研究方向是构建能够实现自动驾驶和机器人自主性的基础模型,这包括通过摄像头实现360度全方位覆盖的现代机器学习堆栈。

  • 特斯拉FSD Beta软件已经覆盖了多少辆车辆?

    -特斯拉FSD Beta软件已经覆盖了大约400,000辆车辆。

  • 特斯拉自动驾驶技术与传统自动驾驶技术有什么不同?

    -特斯拉自动驾驶技术主要依赖摄像头,而不是传统的定位、地图和雷达超声波等传感器,通过现代机器学习技术实现自动驾驶功能。

  • 什么是占用网络,它在自动驾驶中扮演什么角色?

    -占用网络是一种预测3D空间中某个体素是否被占用的模型,它可以代表任意场景,无需特定的标签或本体论设计,是特斯拉自动驾驶技术中的一个关键部分。

  • 特斯拉如何处理车道的预测和表示?

    -特斯拉使用最新的生成模型技术,如自回归变换器,以GPT类似的方式模型化车道,并将它们表示为向量,如多边形线、样条线或多项式,以便在实时中容易使用。

  • 特斯拉是如何实现对移动对象的理解和预测的?

    -特斯拉通过综合考虑摄像头视频流和其他输入(如自我运动学和导航指令)来实现对移动对象的全面理解和预测,包括对象的形状、未来运动等信息。

  • 特斯拉的自动标记管道是如何工作的?

    -特斯拉的自动标记管道通过汇总多辆特斯拉车辆上传的视频片段和其他数据,重建完整的3D场景,并在此基础上使用更多神经网络自动生成标签。

  • 特斯拉如何处理紧急刹车情况?

    -特斯拉系统可以自动检测潜在的撞车风险,如忽略停车标志的车辆或横穿车道的车辆,并自动刹车以避免碰撞。

  • 特斯拉的未来世界模型将如何帮助自动驾驶技术?

    -特斯拉正在开发一个能够基于过去的情况预测未来并模拟不同未来情景的神经网络世界模型,这将大大增强自动驾驶系统处理复杂场景和未知变量的能力。

  • 特斯拉如何确保有足够的计算资源来训练其基础模型?

    -特斯拉正在生产自定义的训练硬件Dojo,并计划成为全球计算平台的领导者,以确保有足够的计算资源来训练和实验其基础模型。

Outlines

00:00

🚗 自动驾驶的创新之路

Asho Kisami在视频中介绍了自己作为特斯拉自动驾驶团队的一员,展示了该团队在自动驾驶和机器人技术方面的最新工作。他们已经向美国和加拿大购买全自动驾驶Beta版的用户推出了软件,覆盖了大约400,000辆车,这些车已经驾驶了约2.5亿英里。这个系统主要依赖于车上的八个摄像头提供的360度视角,与传统的自动驾驶技术不同,这一系统更多地依赖于现代机器学习技术,而不是传统的定位、雷达或超声波传感器。他还介绍了占用网络(Occupancy Networks)作为系统中的一个核心部分,强调了它的通用性和对3D空间的理解能力。

05:01

🌐 处理道路不确定性的先进模型

在第二段中,Asho Kisami深入讨论了预测道路线(lanes)的挑战,特别是考虑到它们的不确定性和复杂性。他提到了使用最先进的生成模型技术,如自回归变换器,类似于GPT的方法来处理这些问题。这种方法能够以向量形式准确预测道路线,这对实时操作至关重要。此外,他也提到了如何处理移动对象,比如车辆和行人,并强调了整个系统采用现代机器学习堆栈的重要性,这允许端到端的感知处理,大大提高了效率和准确性。

10:02

🔮 构建未来的基础模型

第三段讲述了通过重建准确的3D场景并自动生成标签来提高自动驾驶系统精度的方法。Asho Kisami介绍了使用神经网络离线生成道路、交通灯等要素的向量表示,这些都是基于从特斯拉车队收集的大量视频片段。这种方法提供了对世界的深刻理解,并作为基础模型应用于自动驾驶和手动驾驶辅助中,如紧急刹车系统。他还强调了跨物体预测其将来动作的重要性,并分享了Tesla在自动紧急刹车领域的领先成就。

15:04

🌟 跨越新技术的边界

在最后一段中,Asho Kisami谈到了通过预测视频序列中的未来发展来构建一个更广泛应用的世界模型的可能性。他介绍了一个能够基于过去的行为预测未来的神经网络模型,这个模型能够理解和预测复杂场景的动态变化。这种模型不仅可以应用于自动驾驶汽车,还可以扩展到其他领域,如机器人技术。他强调了为了训练这些先进的模型,特斯拉正在成为世界领先的计算力提供者,特别提到了Dojo训练硬件的引入。Asho Kisami概述了这项技术如何帮助实现跨汽车和机器人平台的知识共享,预示了一个充满创新的未来。

Mindmap

Keywords

💡自动驾驶

自动驾驶是指汽车能够在没有人类驾驶员的情况下进行操作和导航的技术。在视频中,提到了特斯拉的自动驾驶团队正在开发一个基础模型,用于自动驾驶和机器人技术,这表明自动驾驶技术是视频讨论的核心内容之一。

💡基础模型

基础模型是指一个通用的、可扩展的模型,它能够作为其他模型开发的基础。在视频中,基础模型被描述为自动驾驶和机器人技术的关键,因为它能够理解和处理各种不同的驾驶场景和对象。

💡机器学习

机器学习是一种人工智能技术,它使计算机系统能够通过数据和算法自我学习和改进。在视频中,特斯拉的自动驾驶技术主要基于现代机器学习技术,尤其是神经网络,这使得汽车能够处理复杂的驾驶任务。

💡神经网络

神经网络是一种模仿人脑神经元结构的计算模型,用于识别模式和处理复杂的数据。在视频中,神经网络被用于自动驾驶汽车的视觉感知系统,以理解和解释周围环境。

💡实时处理

实时处理是指系统能够即时响应输入数据,并快速给出输出结果,而不会有可感知的延迟。在视频中,特斯拉的自动驾驶系统需要在几毫秒内完成车道预测和车辆控制,这要求系统具备高效的实时处理能力。

💡自动驾驶堆栈

自动驾驶堆栈是指构成自动驾驶系统的所有技术和组件的集合,包括感知、决策和控制等各个层面。在视频中,特斯拉的自动驾驶堆栈是基于摄像头和机器学习技术构建的,与传统依赖雷达和超声波传感器的方法不同。

💡自动标记

自动标记是指使用自动化技术为数据集生成标签的过程,这些标签用于训练机器学习模型。在视频中,特斯拉通过其车队收集的大量视频片段来构建自动标记流水线,从而为模型训练提供数据。

💡轨迹校准

轨迹校准是指调整和对齐从不同来源或不同时间点收集的数据,以确保它们在空间和时间上的一致性。在视频中,通过轨迹校准,特斯拉能够从多辆车的视角重建出精确的3D场景。

💡运动规划

运动规划是指为机器人或自动驾驶汽车等自动系统设计一系列动作和路径,使其能够从起点到达目的地。在视频中,特斯拉的自动驾驶系统不仅能够感知周围环境,还能够进行完整的运动规划。

💡计算平台

计算平台是指提供必要计算资源以支持软件应用运行的硬件和软件环境。在视频中,特斯拉正在开发名为Dojo的自定义训练硬件,旨在成为世界领先的计算平台之一。

Highlights

特斯拉Autopilot团队的演讲,介绍了他们对于自动驾驶和机器人技术的基础模型研究。

特斯拉已经向美国和加拿大的购买者全面推送了全自动驾驶Beta软件。

大约有40万辆车辆在FSD Beta上行驶了高达5亿英里。

特斯拉的自动驾驶堆栈是可扩展的,可以在美国任何地方使用。

特斯拉的自动驾驶主要依赖车上的八个摄像头,提供360度全方位覆盖。

特斯拉的自动驾驶技术与传统的自动驾驶方法不同,主要基于机器学习和神经网络。

特斯拉使用占据网络(occupancy networks)作为其自动驾驶堆栈的重要组成部分。

占据网络能够预测3D空间中的体积是否被占据以及占据的概率。

特斯拉的自动驾驶技术能够实时预测车道流和未来行走路径。

特斯拉的自动驾驶架构虽然看起来复杂,但实际上并不那么复杂。

特斯拉使用最先进的生成建模技术来预测车道,类似于GPT。

特斯拉的自动驾驶系统能够实时预测移动物体的完整运动状态。

特斯拉通过整个车队的视频剪辑建立了复杂的自动标记流水线。

特斯拉的多视角重建技术能够精确对齐不同车辆的数据,重建整个3D场景。

特斯拉正在学习一个更通用的世界模型,能够代表任意事物。

特斯拉的神经网络能够基于过去的视频预测未来,并且可以行动条件化。

特斯拉的自动驾驶技术不仅适用于汽车,也适用于机器人。

特斯拉的Dojo训练硬件将开始生产,旨在成为世界领先的计算平台。

特斯拉的自动驾驶技术将在未来12到18个月内取得重大进展。

Transcripts

play00:00

great thank you so much for the

play00:01

introduction hi everyone my name is asho

play00:03

kisami um I a cesla on the autopilot

play00:06

team hopefully you're able to hear my

play00:08

voice and see my screen and video and

play00:10

things please let me know if that's not

play00:12

the

play00:16

case yeah today I would like to present

play00:18

uh our work on what we think is going to

play00:22

be the foundation model for uh autonomy

play00:24

and

play00:25

Robotics this is not just a work of

play00:27

myself I'm representing a large team of

play00:30

talented engineers in our

play00:31

team let's get

play00:35

started um our team has shipped the full

play00:38

self-driving bet software to everyone

play00:41

who has purchased it in the United

play00:43

States and Canada um there roughly

play00:46

400,000 vehicles and today they have

play00:48

driven uh up to 250 m 50 million miles

play00:52

uh on FSD

play00:54

beta

play00:56

uh I think the cool thing about this is

play00:58

that this is a scalable s driving stack

play01:00

that you can take the car anywhere to

play01:02

the US turn it on put in a destination

play01:04

and the car would attempt to navigate to

play01:06

the destination you know handling all of

play01:08

the uh turns shopping at traffic lights

play01:11

interacting with other objects and all

play01:13

of this is driven primarily by the eight

play01:17

cameras that are on the car that are uh

play01:19

giving a full 360 degree coverage around

play01:22

the

play01:24

car the reason that works well is

play01:26

because our stack is based on you know

play01:28

really modern machine learning based

play01:29

stack where uh a lot of the components

play01:32

of the self driving stack are just

play01:33

folded into neur neural networks and I

play01:36

would say this is different than the

play01:38

more traditional approach to

play01:39

self-driving which uses localization

play01:42

Maps lot of Radar Radar Ultrasonics Etc

play01:45

to fuse together instead this is

play01:48

primarily being driven by just cameras

play01:51

and you can if you have a test yourself

play01:53

you can obviously by the car and

play01:55

experience it otherwise just to take my

play01:57

word or look at some videos but it works

play01:59

uh quite well and we are in the process

play02:01

of making it even

play02:05

better I shared about these occupancy

play02:08

networks that are one of the more

play02:10

important pieces in our stack um I would

play02:13

consider this as one of the foundational

play02:15

model uh task because this is very

play02:18

general this is a very general task and

play02:20

doesn't have any specific onology

play02:22

um or like at least robust to ontology

play02:24

errors uh it really just predicts

play02:26

whether some voel in 3D space is

play02:29

occupied or not and the probability of

play02:32

that and um and cans represent you know

play02:34

arbitrary scenes there's no um labeling

play02:38

or onology design required and it's

play02:41

quite General and can apply like

play02:42

anywhere in addition to just the

play02:44

occupancy we also predict the flow of

play02:46

walks in the future that kind of like

play02:49

gives arbitrary motion as well um and

play02:51

everything runs in real time this is

play02:54

quite similar to Nerf in general uh but

play02:57

unlike Nerf or like say multiv view

play02:58

reconstruction which which is usually

play03:00

done for a single scene uh we predict

play03:02

the occupancy based on the eight cameras

play03:04

in real time so the video streaming and

play03:07

then we just predict for all space

play03:09

around the car on whether this walk is

play03:11

occupied or not as opposed to like doing

play03:13

this post like offline post procing

play03:18

step so architecture looks um you know

play03:22

with a lot of it looks very complicated

play03:24

but then it's actually not that

play03:25

complicated in the end um videos from

play03:28

multiple cameras streaming and you can

play03:30

choose whatever backbone you want you

play03:32

know R Nets uh whatever the latest

play03:34

widths uh you can throw anything in

play03:36

there and then everything comes together

play03:38

in uh large Transformer block that does

play03:41

sort of a spatial attention to build up

play03:43

features and also does temporal

play03:45

attention um with some like geometry

play03:47

thrown in there to form some features

play03:49

that that can then be upsampled uh into

play03:52

the actual predictions it's it's quite

play03:55

straightforward even though the diagram

play03:56

looks a bit

play03:58

complicated

play04:00

in the same architecture and the

play04:01

modeling can be used not just for

play04:03

occupancy but for other tasks that are

play04:05

needed for driving um obviously lanes

play04:07

and roads are very important for driving

play04:10

task but I'd say lanes are quite

play04:12

obnoxious um to predict the reason is um

play04:16

you know first of all L are higher

play04:17

dimensional objects unlike um you know

play04:21

definitely not like 1D or 2D like you

play04:22

know High dimensional and then they have

play04:24

like a graph structure um like objects

play04:27

for the most part like say Vehicles

play04:29

there self-contained they're just you

play04:31

know local whereas Lanes can span the

play04:34

entire road you can see multiple miles

play04:35

of lanes in your view um and they can

play04:38

fork and merge and cause all kinds of

play04:40

trouble in the

play04:41

modeling um they also have large

play04:45

uncertainty sometimes you don't you

play04:46

might not be able to like um view the

play04:48

lanes because they're uded or it's

play04:50

nighttime only part of the lane is

play04:52

visible and it's not just that sometimes

play04:54

even if everything is visible even

play04:57

humans cannot agree on whether some

play04:59

thing that you're looking at is two

play05:00

lanes or one lane for instance so

play05:02

there's a ton of uncertainty in um like

play05:05

what are

play05:06

lanes and then it's not sufficient to

play05:09

just predict them as some kind of raster

play05:11

it's very hard to use Downstream then so

play05:13

it's better to predict them as some kind

play05:15

of vector representation you know like

play05:17

poly lines plines polinomial Etc to help

play05:21

with use usee of use and all of this

play05:22

needs to happen within tens of

play05:24

milliseconds in real

play05:25

time like I said it's like a very

play05:27

difficult problem to predict lenses in

play05:29

real time

play05:30

in the real

play05:31

world nonetheless we use

play05:33

state-of-the-art generative modeling

play05:35

techniques um in this case we use you

play05:37

know Auto regressive Transformers quite

play05:39

similar to I would say GPT uh in terms

play05:42

of how we model the lanes so we can

play05:44

predict the uh you know you can tokenize

play05:46

the lenses and predict them one token at

play05:48

a time um unlike language which is

play05:50

mostly linear we have to you know

play05:53

predict the full graph structure hence

play05:55

we come back uh predict you know what

play05:57

are the forking Point what is the

play05:59

merging point Point Etc and everything

play06:01

is done end to end using neural networks

play06:03

with like little to no postprocessing

play06:06

required after

play06:10

this another important task for driving

play06:13

is obviously moving objects you know

play06:15

Vehicles drugs pedestrians what have you

play06:17

and it's not sufficient to just detect

play06:19

them you need to have their full

play06:20

kinematic

play06:21

State um and also predict their shape

play06:24

information their Futures Etc all of

play06:27

these models the models that I described

play06:28

earlier even the lanes one and objects

play06:30

one or in some ways multimodel models in

play06:32

the sense that they take in not just

play06:34

camera um video streams they also take

play06:36

in other inputs such as um in this case

play06:39

ego's own kinematics so os's um velocity

play06:42

acceleration jerk Etc all goes in uh we

play06:46

also provide in um the navigation

play06:48

instructions to the lanes to kind of

play06:50

guide us where to like which lane to use

play06:53

Etc so everything is done within the

play06:55

network that's why I say it's like a

play06:56

modern machine learning stack where

play06:58

instead of doing this in post proc in we

play07:00

just try to combine everything and then

play07:01

do perception sort of end to end so to

play07:06

speak so here you can see predictions of

play07:08

these models um the lanes that you see

play07:11

here the vehicles that you see here are

play07:13

all just predicted again the by these

play07:15

networks without a lot of postprocessing

play07:17

there's no tracking or anything like

play07:18

that uh in in the things that you're

play07:20

seeing here so overall you know I would

play07:23

say that's quite stable the green um

play07:26

spines that are coming out of these

play07:27

vehicles are just their for C future

play07:30

it's kind of like a standard task at

play07:31

this point I would say uh but you know

play07:33

it all works quite nicely and in real

play07:36

time in the

play07:40

car doesn't have to stop with just

play07:42

perception to once we have all of these

play07:44

percepts um you know like Lanes

play07:46

occupancy objects and even few more like

play07:49

traffic controls and other things you

play07:51

can do the entire motion planning um

play07:53

also using just a network um I hav't go

play07:56

into too many details on like how we do

play07:57

that but essentially you know it can

play07:59

just be thought of as like one more task

play08:01

instead of it being a separate

play08:05

thing so how is all of this possible and

play08:08

I think it's because we have built the

play08:09

sophisticated Auto labeling pipeline

play08:11

that gives us data from the entire fleet

play08:13

uh you know millions of video clips

play08:15

across the entire world uh can be tapped

play08:18

on the left side what you're seeing is

play08:21

um an example of multi Rec construction

play08:24

where we choose some

play08:26

location multiple Tesla vehicles driving

play08:29

through the location upload their video

play08:32

clips and other other like you know

play08:33

vehle kinematic data to us we bring

play08:36

everything together and reconstruct the

play08:38

entire 3D scene um so the um SPL the

play08:42

poly lines that you look you know there

play08:43

the San colored one there's also a few

play08:45

other colored ones those are all

play08:46

different cars doing different trips uh

play08:49

to the world and it's actually all well

play08:52

very well aligned let me see if I can

play08:54

play again yeah the pink line and the S

play08:56

line that you see there those are like

play08:58

different trips of different cars uh

play09:00

driving around and everything is just

play09:01

aligned very nicely um and this

play09:04

multi-rip reconstruction has enabled us

play09:05

to you know get all the lanes Road lines

play09:09

um everything directly from the fleet in

play09:11

the millions um anywhere on Earth

play09:16

essentially once you have this like base

play09:19

um structure of trajectories calibration

play09:21

from all these cameras you can really do

play09:24

a lot of cool things to reconstruct the

play09:25

entire scene not sure the video plays

play09:27

quite nicely but in my on on screen it

play09:30

looks very smooth where you can see the

play09:32

ground surface it's reconstructed quite

play09:34

nicely um there's like no artifact such

play09:37

as double vision or blurring things are

play09:40

a crisp um look like geometric they are

play09:43

correct um this is a hybrid approach to

play09:46

Nerf and um General 3D

play09:49

reconstruction sometimes in Nerf even

play09:52

though the um rendered visuals might

play09:54

look very nice the underlying geometry

play09:57

might be very fuzzy and cloudy um so we

play09:59

have a hybrid approach which works quite

play10:01

nicely you can see here you know all the

play10:04

barriers the vehicles even trucks Etc

play10:07

are reconstructed pretty

play10:20

accurately once we have these uh

play10:23

reconstructions we then run even more

play10:25

neural NWS just offline to produce the

play10:27

labels that we want like I had men

play10:29

mentioned earlier for lenses we need uh

play10:31

some kind of vector representation to

play10:33

make it very easy to use um so instead

play10:36

of just using a raster directly take

play10:38

rasters we have offer neural networks

play10:40

that run on top of it and then produce

play10:42

the vector representation that can be

play10:44

then use as labels for the online

play10:48

stack similar to the lanes like once you

play10:50

have the lanes and the roads

play10:51

reconstructed you can also Auto label

play10:53

traffic lights here you're seeing

play10:55

traffic lights autol labeled by our

play10:56

system without any human inputs um and

play11:00

these are all like multiv view

play11:01

consistent me try to play again yeah U

play11:04

we we can predict their shape color

play11:08

relevancy uh you can see this white

play11:11

traffic lights on the side they also

play11:12

reproject correctly into all the camera

play11:14

views and it's because we have this

play11:15

really good autoing system that

play11:18

calibrates everything jointly and it's

play11:20

like Pixel Perfect in 3D

play11:23

space so yeah all of these predictions

play11:26

together know give us super SL

play11:28

understanding um of the world from

play11:30

cameras and um I would already try to

play11:33

call this you know sort of a foundation

play11:34

model that can be used in a lot of

play11:36

different places um and these

play11:38

predictions really help FSD um you know

play11:41

drive in any place um like you don't

play11:43

have to Geor restricted you can you know

play11:46

even sometimes construct a new road turn

play11:48

it on and it would work quite nicely

play11:50

there in addition they also help manual

play11:52

driving um obviously because you know uh

play11:55

humans are not perfect

play11:57

drivers so they need some help every now

play11:59

and then in this case on the left side

play12:02

the ego driver for some reason blew past

play12:05

the stop sign and was almost about to

play12:07

crash into this red car uh and our

play12:10

system um you know DED this and then

play12:12

break

play12:14

automatically and similar on the right

play12:16

side driving straight and then someone

play12:18

just comes in then Cuts us off it's

play12:20

quite dangerous but then the system app

play12:23

the break quite early uh the reason why

play12:25

this is different than you know what you

play12:27

think that AB systems have been there

play12:29

for since 1980s what is new about this I

play12:32

think Tesla is the first company to ship

play12:34

uh this emergency braking for crossing

play12:36

vehicles to the best of my knowledge uh

play12:38

and the reason that that Crossing

play12:40

objects are harder unlike say vehicles

play12:42

that are in your own lane is because um

play12:45

for crossing objects you need to know

play12:47

whether they're going to stop in time or

play12:48

not um what is the stop line do they

play12:50

have traffic lights uh and if they were

play12:53

to turn which Lanes they turn into Etc

play12:55

there's a ton of work that needs to

play12:57

happen to understand where the in

play12:59

objects would go and like likely to go

play13:01

do they have you know room to stop Etc

play13:02

it's it's not as simple as just

play13:04

directing a vehicle and having their

play13:05

velocities and things like

play13:07

those yeah so like I said I think I I

play13:10

believe Tesla is the first company to

play13:12

ship Crossing ab and it's already in

play13:14

customers hands in the last several

play13:17

months but is the foundation model

play13:19

really just you know a bunch of these

play13:21

stats concatenated together or can there

play13:23

be more to it we think there can be more

play13:26

to it you know um these STS like in the

play13:28

occupancy for example white while it's

play13:30

quite General um some some things are

play13:33

harder to represent even in that space

play13:35

I'll probably go to more details

play13:38

shortly so that's why we are working on

play13:41

um learning a more General World model

play13:44

that can really just represent uh

play13:46

arbitrary things so in this case what we

play13:48

do is we have a neural network that can

play13:51

be conditioned on the past or other

play13:52

things to predict the future and I mean

play13:56

obviously everyone has wanted to work on

play13:57

this forever and and I think with the

play13:59

recent um rise in generative models like

play14:02

you know Transformers diffusion Etc we

play14:05

finally have a shorted it what you're

play14:08

seeing here is purely generated video

play14:12

sequences given the past videos the

play14:15

network predicts some sample from the

play14:18

future hopefully the most likely sample

play14:20

and you can see that it is being

play14:22

predicted not just for one camera but it

play14:24

predicts the all the eight cameras

play14:26

around the car uh jointly and see how

play14:29

you know the car colors are consistent

play14:31

across the cameras the motion of objects

play14:33

is consistent in 3D even though we have

play14:36

not explicitely asked it to do anything

play14:38

in 3D or not even baked in any 3D priors

play14:41

this is just the network understanding

play14:43

depth and motion on its own without us

play14:46

uh informing it of

play14:48

so and since this is all just predicting

play14:51

future you know RGB values the ontology

play14:54

is quite General you can like throw any

play14:56

video clip from driving or you know from

play14:59

YouTube or from your own phone anything

play15:01

can be used to uh train this General

play15:04

Dynamics model of the

play15:07

world additionally it can also be action

play15:10

conditioned um show a few examples so

play15:13

here on the left side the car is driving

play15:15

in the lane and we're asking it to okay

play15:17

just keep in this Lane and keep driving

play15:19

and then you know like I said earlier

play15:21

the car is able to or the model is able

play15:24

to predict all of the geometry Flow by

play15:26

uh very nicely and understands 3

play15:30

on the right here we're asking to change

play15:31

lenses to the right

play15:33

side uh maybe we'll go back and fate

play15:36

again so on the left it's just going

play15:37

straight and we ask it to go straight

play15:40

the model goes straight and then on the

play15:42

right side we ask it to make a lane

play15:43

change and it makes a lane change and

play15:45

the past context is the same for both of

play15:49

these um outputs so given the same past

play15:52

and when we ask it for different Futures

play15:54

then model is able to produce or like

play15:57

imagine different futures

play16:00

this is super powerful because you know

play16:01

now you have essentially a neural

play16:03

network simulator uh that can U simulate

play16:06

different Futures based on different

play16:07

actions and unlike a traditional game

play16:10

simulator this is way more powerful

play16:12

because uh it can you know

play16:15

represent things that are very hard to

play16:18

describe in an expit system um I'll show

play16:21

a few more examples but then um is super

play16:23

powerful and also the mo the intention

play16:27

and then the natural behavior of other

play16:29

objects such as Vehicles is very hard to

play16:31

represent explicitly but in this world

play16:33

it's very easy to

play16:35

represent it doesn't have to stop with

play16:37

just RGB you can obviously do this kind

play16:39

of future prediction task not just RGB

play16:42

but also in optic segmentation or you

play16:44

can extend it to also 3D spaces where

play16:46

you can imagine future 3D uh scenes

play16:49

entirely based on just the past and then

play16:51

your action prompting or even without

play16:53

prompting You can predict different

play16:56

Futures uh this is I personally I you

play16:59

know I'm Amazed by how uh well this

play17:01

works uh and it's very exciting future

play17:05

uh that we are working on

play17:08

here yeah here are some examples where I

play17:10

think you know something like this is

play17:11

going to be needed to represent um like

play17:15

what's happening in the scene like

play17:16

there's a lot of smoke coming in one of

play17:17

the uh pictures like there's paper

play17:19

flying everywhere you know that's going

play17:21

to be tough for you know even um

play17:24

occupancy where okay you have paper

play17:26

flying everywhere there's occupancy

play17:27

there's occupancy flow but then how do

play17:29

you know it's paper what do you know the

play17:30

material properties of it um there's

play17:34

like smoke obviously you can drive

play17:35

through it but you know it is occupying

play17:36

space and light does not transmit

play17:38

through um you know there's a lot of

play17:40

nuances to driving and we have to really

play17:42

solve all of these problems to build a

play17:44

general driving stack that can drive

play17:46

anywhere in the world and be humanlike

play17:48

might be fast efficient uh at all speeds

play17:52

yet very safe uh and I think you know we

play17:54

are working on the right recipe for um

play17:56

building this

play18:00

obviously training all these models

play18:02

takes a ton of compute and that's why

play18:05

Tesla is aiming to become a world leader

play18:08

uh in compute Dojo is our training

play18:11

Hardware that we have custom built at

play18:13

Tesla that is starting production next

play18:15

month essentially uh and with that we

play18:18

you know we we think uh we are on the

play18:22

way to become one of the top um computer

play18:24

Platforms in the entire world and we

play18:26

also think that in order to train these

play18:28

Foundation for vision we need a lot of

play18:29

compute and not just train compute for

play18:32

training this one model but compute to

play18:33

try a lot of different experiments to

play18:35

see which models actually work well uh

play18:38

and that's why it's super exciting for

play18:39

us to you know uh be in this spot where

play18:41

computer is going to be abundant and

play18:43

it's just going to be bed by ideas as of

play18:49

Engineers and T is this is not just

play18:51

being built for the car but also for the

play18:53

robot um we already have the occupancy

play18:56

networks for example and other few of

play18:57

networks all shared between the car and

play19:00

the robot and it actually works quite

play19:03

well and generalizes across these

play19:04

platforms and we want to extend it to

play19:06

all the T all the tasks that we have um

play19:09

even like lanes and vehicles for example

play19:11

should not be specific to cars if the

play19:12

robot say happens to walk to the road

play19:14

and looks around it should understand

play19:16

roads and vehicles and you know how

play19:18

Vehicles move Etc all of this are just

play19:20

be built for both the platforms and you

play19:23

know any other future robotics platform

play19:25

that would also need

play19:27

this

play19:29

that's basically it um we to summarize

play19:32

again we're building this super cool

play19:33

foundational models for vision that

play19:35

really understands everything um and it

play19:37

should generalize across cars and robots

play19:39

it's going to be trained on tons of

play19:42

diverse data from the fleet on tons of

play19:44

compute um yeah so I'm really excited

play19:47

for the next um you know 12 to 18 months

play19:49

of what's going to happen uh here thank

play19:52

you

Rate This

5.0 / 5 (0 votes)

Related Tags
自动驾驶机器学习特斯拉视觉感知神经网络实时预测车辆控制机器人技术计算平台Dojo硬件
Do you need a summary in English?