Sora Creator “Video generation will lead to AGI by simulating everything” | AGI House Video

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
7 Apr 202432:36

Summary

TLDR该视频脚本介绍了一个名为Sora的创新视频生成技术,它通过深度学习和Transformer模型,能够生成高分辨率、一分钟长的视频。Sora技术不仅推动了内容创作的革新,还被视为通向人工智能的重要一步。视频中展示了Sora生成的多样化视频样本,包括复杂的3D场景、物体持久性以及特殊效果等,突显了其在创意表达和模拟现实世界的潜力。同时,讨论了Sora的技术细节,如数据训练、模型扩展性和未来的可能性,以及它在艺术创作和娱乐产业中的潜在应用。

Takeaways

  • 🌟 视频生成技术的进步,特别是AI生成的视频,是通向AGI(人工通用智能)的关键路径之一。
  • 📹 AI视频生成模型Sora能够创建高清、一分钟长的视频,这在视频生成领域是一个巨大的飞跃。
  • 🤖 Sora通过学习理解物理世界的复杂性,包括对象的持久性和3D空间的理解。
  • 🎨 Sora的应用前景广泛,包括内容创作、特殊效果和动画制作等,为创作者提供了新的创作工具。
  • 🔄 Sora使用Transformer模型,通过在大量不同类型的视频和图像上进行训练,学习生成多样化的内容。
  • 🧠 通过增加计算能力和数据,Sora的性能和生成的视频质量将不断提升。
  • 🎥 Sora在生成复杂场景和动物行为方面表现出色,显示出对物理世界的深入理解。
  • 🏙️ Sora能够生成具有一致3D几何结构的高质量航拍视频。
  • 🔄 Sora展示了在视频中插值不同对象的能力,创造出无缝过渡的视觉效果。
  • 📈 Sora项目的目标是提升视频生成的质量,同时保持方法的简单性和可扩展性。
  • 🚧 尽管Sora取得了显著进展,但在处理某些物理交互和基础物理概念方面仍有待提高。

Q & A

  • Aji house是如何尊重像在场的各位这样的人物的?

    -Aji house通过邀请像在场的各位这样的人物参与,共同创造和体验新技术,来表达对他们的尊重和重视。

  • 视频中提到的Sora项目是什么?

    -Sora是一个视频生成项目,它利用团队和众多杰出合作者的力量,通过高级技术手段,能够创造出具有复杂细节和物理世界理解的视频内容。

  • Sora项目在视频生成方面取得了哪些成就?

    -

Outlines

00:00

🎬 介绍与Aji House的合作

本段落介绍了Aji House的合作项目,强调了与合作伙伴的共同价值观和目标。提到了与Bill和Tim的合作,以及他们与一支卓越的团队一起,致力于开发能够对内容创作产生影响的技术。此外,还提到了一段特别的视频,展示了视频生成技术的进步,如视频的清晰度、时长和复杂性。讨论了视频生成中的一些技术挑战,例如对象的持久性和一致性问题,以及如何通过训练视频理解物理世界。

05:01

🌟 视频生成技术的潜力与应用

这一部分讨论了视频生成技术的潜力,特别是在内容创作和人工智能(AGI)方面的短期和长期影响。提到了视频生成技术如何帮助创造全新的媒体和娱乐形式,以及如何使那些没有专业工具的人能够实现他们的创意。还展示了一些艺术家使用这项技术创作的样本,强调了技术的多样性和创造性。此外,还讨论了视频生成技术背后的一些技术细节,如Transformer模型和如何通过扩展计算能力来提高模型性能。

10:02

📈 技术进步与模型扩展

本段落重点介绍了视频模型的技术进步,包括如何处理不同分辨率和宽高比的视频,以及如何通过零样本学习来实现视频编辑和风格转换。讨论了视频模型如何通过插值和变换来创造新的视觉效果,以及如何通过扩展视频生成系统来提高其性能和创造力。此外,还提到了视频模型在模拟现实世界物理规律方面的潜力,以及如何通过学习不同世界的规则来增强模型的能力。

15:04

🚀 视频模型对AGI的贡献

这一部分讨论了视频模型如何有助于实现人工通用智能(AGI)。强调了视频模型通过模拟人类互动和物理接触来理解人类行为的重要性。提到了通过扩展模型来提高其对复杂场景的理解能力,以及如何通过学习动物行为和3D一致性来增强模型。此外,还讨论了视频模型在模拟不同世界(如Minecraft)方面的潜力,以及如何通过学习各种规则和对象来提高模型的泛化能力。

20:05

🛠️ 技术挑战与未来展望

本段落讨论了视频模型在技术实现上面临的挑战,如处理某些物理交互的困难,以及如何通过与艺术家和安全专家的合作来改进模型。提到了艺术家对模型控制性的需求,以及如何通过微调来适应特定的角色或IP。还讨论了视频模型的生成过程,包括如何通过去噪来创建视频,以及如何通过生成较短的视频然后扩展来提高效率。最后,提到了对AGI的展望,以及如何通过创造性地解决问题来克服数据和技术上的限制。

Mindmap

Keywords

💡AI

人工智能(AI)是指由人造系统所表现出来的智能行为。在视频中,AI是实现视频生成和理解物理世界的核心。例如,通过AI技术,可以生成具有复杂物理交互的视频内容,这是向通用人工智能(AGI)迈进的关键一步。

💡视频生成

视频生成是指使用计算机算法自动创建视频内容的过程。在视频中,这是通过AI模型Sora实现的,它能够根据给定的提示生成具有高复杂度和细节的视频片段。

💡物体持久性

物体持久性是指在视频生成中保持物体在连续帧中持续存在和一致性的能力。这对于AI来说是一个挑战,因为它需要理解物体在时间上的连续性和空间上的稳定性。

💡3D几何

3D几何是指在三维空间中对形状、大小和位置关系的数学研究。在视频中,Sora通过学习3D几何来理解视频中的场景和物体的三维结构。

💡内容创作

内容创作是指创造和制作新的文本、视频、音频或其他媒体内容的过程。在视频中,Sora的AI技术被用来革新内容创作,使得创作者能够生成前所未有的视频和媒体体验。

💡Transformer模型

Transformer模型是一种深度学习模型,它在自然语言处理领域取得了巨大成功,并被应用于各种序列到序列的任务。在视频中,Sora使用了类似Transformer的架构来处理和生成视频内容。

💡扩散模型

扩散模型是一种生成模型,它通过逐步从噪声数据中去除噪声来生成数据样本。在视频中,Sora使用了扩散技术来生成视频,即从一个完全随机的噪声状态开始,逐步生成清晰的视频内容。

💡零样本学习

零样本学习是指模型在没有见过特定类别的样本的情况下,仍然能够识别或生成该类别的样本。在视频中,Sora展示了这种能力,它可以根据文本提示生成视频,即使之前没有接受过类似训练。

💡交互式视频

交互式视频是指允许用户通过某种方式与视频内容进行交互的视频。在视频中,虽然Sora目前还不能实现用户直接与生成的视频进行交互,但它展示了AI技术在未来可能实现此类交互的潜力。

💡数据集

数据集是指用于训练机器学习模型的一系列数据样本的集合。在视频中,Sora的训练涉及了大量的视频和图像数据集,这些数据集对于模型学习如何生成高质量的视频至关重要。

Highlights

Aji House 尊重像你们这样的人们,这是我们邀请你们来到这里的原因。

我们与一支卓越的合作团队一起,通过开放的方式制作了 s,今天我们很兴奋地向你们介绍一些关于它的信息。

我们将会讨论它的高级功能、对内容创作的影响、背后的技术以及为什么这是走向 AI 的重要一步。

我们生成的视频是高清的,并且长达一分钟,这是我们制作视频时的一个重大目标。

视频中的复杂性,如反射和阴影,以及物体的持久性和一致性,是视频生成中一个非常难解决的问题。

Sora 不仅能够生成内容,而且还从训练视频中学习到了关于物理世界的大量智能。

视频生成为我们提供了革命性的内容创作机会,我们对此感到非常兴奋。

Sora 能够生成具有不同风格的视频,例如纸艺世界,这非常酷。

Sora 能够理解完整的 3D 空间,因此当摄像机在 3D 中移动时,它真正理解了几何和物理复杂性。

Sora 的一个样本是一个电影预告片,展示了一个 30 岁的宇航员的冒险,这个宇航员在多个镜头中持续出现,这是通过 Sora 生成的。

Sora 能够为特殊效果带来很多创新,这是我们最喜欢的样本之一,一个自然融入纽约市的外星人。

这项技术将为好莱坞传统 CGI 流程中通常非常昂贵的特效带来很多影响。

Sora 能够生成动画内容,这是我们最喜欢的一个样本,它有一点魅力。

Sora 能够生成传统好莱坞基础设施难以拍摄的场景,例如纽约市的 Blom Zoo 商店。

我们与艺术家合作,探索 Sora 的可能性,这是我们研究的一部分,我们认为这是了解这项技术价值的最佳方式。

Sora 正在学习如何模拟真实世界的物理,这是我们项目的关键组成部分,但我们也可以模拟其他类型的世界。

我们认为 Sora 就像视频的 GPT-1,我们相信这项技术将很快变得更好,并且人们将在此基础上构建令人惊叹的创新。

Transcripts

play00:01

here's here's one thing about Aji house

play00:04

we honor the people like you guys that's

play00:06

why we have you here that's why you're

play00:08

all here without the further Ado we're

play00:12

[Applause]

play00:17

going awesome what a big fun craft i'm

play00:20

Tim this is Bill and we made s at open

play00:23

ey together with a team of amazing

play00:26

collaborators we're excited to tell you

play00:28

a bit about it today we'll talk a bit

play00:30

about at a high level what it does some

play00:33

of the opportunities it has to impact

play00:35

content creation some of the technology

play00:38

behind it as well as why this is an

play00:40

important step on the path to

play00:43

AI so without further Ado here is a sord

play00:46

generated video this one is really

play00:49

special to us because it's HD and a

play00:51

minute long and that was a big goal of

play00:53

ours when we're trying to figure out

play00:56

like what would really make a leap for

play00:57

video generation you want 1080 fet

play01:00

videos that are a minute long this video

play01:02

does that you can see it has a lot of

play01:03

complexity too like in the reflections

play01:06

and the Shadows one really interesting

play01:08

point that sign that blue sign she's

play01:09

about to walk in front of

play01:12

it and after she passes the sign still

play01:15

exists afterwards that's a really hard

play01:18

problem for video generation to get this

play01:19

type of object permanence and

play01:21

consistency over a long

play01:24

duration so I can

play01:27

do here we go a number of different St

play01:29

Styles too this is a paper craft world

play01:33

that I can imagine so that's really cool

play01:39

and and also it can learn about a full

play01:43

3d space so here the camera moves

play01:45

through 3D as people are moving but it

play01:48

really understands the geometry and the

play01:50

physical complexities of the so Sor

play01:53

learned a lot in addition to just being

play01:56

able to generate content it's actually

play01:58

learned a lot of intelligence about the

play02:00

physical world just from training when

play02:02

videos and now we'll talk a bit about

play02:06

some of the opportunities with video

play02:07

generation for revolutionizing creation

play02:10

alluded to we're really excited about

play02:12

Sora not only because we view it as

play02:15

being on the critical path towards AGI

play02:17

but also in the short term for what it's

play02:19

going to do for Content so this is one

play02:21

sample we like a lot so the front in the

play02:23

bottom left here

play02:25

is a movie trailer featuring The

play02:27

Adventures of the 30-year-old Spaceman

play02:29

hardest part of doing video by the way

play02:31

is always just getting PowerPoint to

play02:33

work with

play02:40

it there we

play02:41

[Music]

play02:43

go all right now we're what's cool about

play02:46

this sample in particular is that this

play02:48

astronaut is persisting across multiple

play02:50

shots which are all generated by we

play02:51

didn't Stitch this together we didn't

play02:52

have to do a bunch of outakes and then

play02:54

create a composite shot at the end sword

play02:57

decides where it wants to change the

play02:58

camera but it do know that it's going to

play02:59

put the same astronaut in a bunch of

play03:01

different environments likewise we think

play03:03

there's a lot of cool implications for

play03:05

special effects this is one of our

play03:06

favorite samples too an alien blending

play03:08

in naturally New York City paranoia

play03:09

Thriller style 35 mil and already you

play03:12

can see that the model is able to create

play03:14

these very Fantastical effects which

play03:16

would normally be very expensive in

play03:17

traditional CGI pipelines for Hollywood

play03:19

there's a lot of implications here for

play03:21

what this technology is going to bring

play03:23

shortterm of course we can do other

play03:25

kinds of effects too so this is more of

play03:28

a Sci-Fi scene so it's a scuba diver

play03:30

discovering a hidden futuristic

play03:31

shipwreck with cybernetic marine life

play03:33

and advanced alien

play03:34

[Music]

play03:37

technology as someone who's seen so much

play03:40

incredible content from people on the

play03:42

internet who don't necessarily have

play03:43

access to tools like Sora to bring their

play03:45

Visions to life they come up with like

play03:47

cool phosy post them on Reddit or

play03:48

something it's really exciting to think

play03:50

about what people are going toble to do

play03:52

with this

play03:52

technology of course it can do more than

play03:55

just photo realistic style you can also

play03:58

do animated content

play04:00

really ADOT my favorite part of this one

play04:02

is the spell

play04:04

otter a little bit of

play04:10

charm and I think another example of

play04:14

just how cool this technology is is when

play04:16

we start to think about scenes which

play04:18

would be very difficult to shoot with

play04:20

traditional Hollywood kind of

play04:22

infrastructure the problems here is the

play04:23

Blom Zoo shop in New York City is with

play04:25

the jewelry store and Zoo saber-tooth

play04:27

tigers with diamond and gold adornments

play04:29

Turtles with listening Emerald shells

play04:30

Etc and what I love about this shot is

play04:33

it's photo realistic but this is

play04:35

something that would be incredibly hard

play04:36

to accomplish with traditional tools

play04:38

that they have in Hollywood today this

play04:39

kind of shot would of course require CGI

play04:41

it would be very difficult to get real

play04:43

world animals into these kinds of scenes

play04:45

but with Sora is pretty trivial and it

play04:46

can just do it

play04:49

on so I'll hand it over to Tim to chat a

play04:52

bit about how we're working with artists

play04:53

today with sort to see what they're able

play04:55

to do yeah so we just came out with

play04:58

pretty recently we given access to a

play05:00

small pool of artists and maybe even to

play05:03

take a step back this isn't yet a

play05:05

product or something that is available

play05:07

to a lot of people it's not in chat TBT

play05:09

or anything but this is research that

play05:11

we've done and we think that the best

play05:13

way to figure out how this technology

play05:16

will be valuable and also how to make it

play05:17

safe is to engage with people external

play05:19

from oration so that's why we came out

play05:22

with this announcement and when we came

play05:24

out with the announcement we started

play05:25

working with small teams of red teamers

play05:27

who helped with the safety work as well

play05:29

as artist and people who will use this

play05:31

technology so Shai kids is one of the

play05:33

artists that we work with and I really

play05:36

like this quote from them as greatest

play05:38

Sora is at generating things that appear

play05:40

real what excites us is the ability to

play05:42

make things that are totally surreal and

play05:45

I think that's really cool because when

play05:47

you immediately think about oh

play05:49

generating videos we have all these

play05:52

existing uses of videos that we know of

play05:54

in our lives and we quickly think about

play05:56

turning those oh maybe stock videos or

play05:58

existing films but what's really

play06:00

exciting to me is what totally new

play06:02

things are people into what completely

play06:04

new forms of media and entertainment and

play06:07

just new experiences for people that

play06:08

we've never seen before are going to be

play06:10

enabled by by Sora and by Future

play06:12

versions of video and media generation

play06:15

technology and now I want to show this

play06:18

fun video that shy kids made using Sor

play06:22

when we gave access to

play06:24

them oh okay it has audio unfortunately

play06:27

I guess we don't have that hooked up

play06:37

it's this cute

play06:39

plot about this guy with the balloon

play06:43

head you should really go and check it

play06:45

out we came out with this blog post Sora

play06:47

First Impressions and we have videos

play06:49

from a number of artists that we've

play06:51

access to and there's this really cute

play06:53

monologue to his guys talking about life

play06:56

from the different perspective of me as

play06:58

a guy with a balloon head right and this

play07:00

is just awesome and so creative and the

play07:03

other artists we've been access to have

play07:05

done really creative and totally

play07:06

different things from this too like the

play07:08

way each artist uses this is just so

play07:11

different from each other artist which

play07:12

is really exciting because that says a

play07:13

bit about the breath of ways that you

play07:16

can use this technology but this is just

play07:19

really fun and there are so many people

play07:21

with such brilliant ideas as Phil was

play07:24

talking about that maybe it would be

play07:26

really hard to do things like this or to

play07:27

make their film or their thing that's

play07:29

not a film that's totally new and

play07:31

different and hopefully this technology

play07:33

will really democratize content Creation

play07:35

in the long run it Ena so many more

play07:37

people with creative ideas to be able to

play07:39

bring those to life and show

play07:45

them I'm want to talk a bit about some

play07:47

of the technology behind soil so I'll

play07:51

talk about it from the perspective of

play07:53

language models and what has made them

play07:57

work so well is the ability to scale and

play08:00

better lesson that methods that improve

play08:02

with scale in the long run are the

play08:04

methods that will win out as you

play08:05

increase compute because over time we

play08:08

have more and more compute and if

play08:09

methods utilize that well then they will

play08:12

get better and better and language

play08:15

models are able to do that in part

play08:17

because they take all different forms of

play08:19

text you take math and code and fros and

play08:23

whatever is out there and you turn it

play08:25

all into this universal language of

play08:27

tokens and then you train

play08:30

these big Transformer models on all

play08:32

these different types of tokens this

play08:34

this kind of universal model of Text

play08:37

data and by training on this vast array

play08:40

of different types of text you learn

play08:43

these really generalist models of

play08:45

language you can do all these things

play08:47

right you can use chat gbt or whatever

play08:49

your favorite language model is to do

play08:51

all different kinds of tasks and it has

play08:53

such a breadth of knowledge that it's

play08:55

learn from the combination of this

play08:58

variety of data and we want to do the

play09:00

same thing for visual here that's

play09:01

exactly what we did with Sora so we take

play09:03

vertical videos and images and square

play09:07

images low resolution high resolution

play09:10

wide aspect ratio and we turn them into

play09:12

patches and a patch is this little cube

play09:16

in SpaceTime that you can imagine a

play09:18

stack of frames a video is like a stack

play09:21

of images that are all the frames and we

play09:23

have this volume of pixels and then we

play09:27

take these little cubes from inside and

play09:29

you can do that on any volume of pixels

play09:31

whether that's a high resolution image a

play09:33

low resolution image regardless of the

play09:35

aspect ratio long videos short videos

play09:38

you turn all of them into these

play09:40

SpaceTime patches and those are our

play09:42

equivalent of tokens and then we train

play09:45

Transformers on these SpaceTime patches

play09:49

and Transformers are really scalable and

play09:51

that allows us to think of this problem

play09:53

in the same way that people think about

play09:55

language problems of how do we get

play09:58

really good at scal them and making

play10:00

methods such that as we increase the

play10:02

compute as we increase the data they

play10:03

just get better and

play10:07

better training on multiple aspect

play10:10

ratios also allows us to generate with

play10:12

multiple aspect

play10:15

ratios there we go so here's the same

play10:17

prompt and you can generate vertical and

play10:20

square horizontal that's also a nice it

play10:23

in addition to the fact that allows you

play10:24

to use more data which is really Valu

play10:26

you want to use all the data in its

play10:28

native format as it exists it also gives

play10:31

you more diverse ways to use the V so I

play10:34

actually think vertical videos are

play10:37

really nice like we look at content all

play10:38

the time on our phones right so it's

play10:39

nice to actually be able to generate

play10:41

vertical and horizontal and a variety of

play10:44

things and we can also use zero shot

play10:47

some videoo video capabilities so this

play10:50

uses a method which is a method that's

play10:51

commonly used with diffusion we can

play10:53

apply that our model uses to Fusion

play10:55

which means that it Den noises the video

play10:58

starting from noise

play10:59

in order to create the video itely noise

play11:02

so we use this method called sedit and

play11:04

apply it and this allows us to change an

play11:07

input video the offer left it's all

play11:10

generated but it could be a real image

play11:11

then we say rewrite the video in pixel

play11:14

art style put the video in space with

play11:16

the Rainbow Road or change the video to

play11:18

a Medieval theme and you can see that it

play11:20

edits the video but it keeps the

play11:22

structure the same so in in a second it

play11:24

will go through a tunnel for example and

play11:27

it interprets that tunnel in all these

play11:28

different ways this medieval one is

play11:30

pretty amazing right because the model

play11:32

is also intelligent so it's not just

play11:34

changing something shallow about it but

play11:36

it's medieval we don't really have a car

play11:38

so I'm going to make a horse

play11:41

scage and another fun capability that

play11:45

the model has is to interpolate between

play11:49

videos so here we have two different

play11:52

creatures and this video in the middle

play11:55

starts with the left and it's going to

play11:56

end with the right and it's able to do

play11:59

it in this really

play12:01

seamless and amazing

play12:03

[Music]

play12:06

way so I think something that the past

play12:09

slide in this slide really point out is

play12:11

that there are so many unique and

play12:13

creative things you can potentially do

play12:15

with these models and similar to how

play12:18

when we first had language models

play12:20

obviously people were like oh like you

play12:22

can use it for writing right okay yes

play12:24

you can but there are so many other

play12:26

things you can do with language models

play12:27

and we're only we're even now like every

play12:29

day people coming up with some creative

play12:31

new cool thing you can do with the

play12:33

language model the same thing's going to

play12:34

happen for these visual models there are

play12:36

so many creative interesting ways in

play12:38

which we can use them and I think we're

play12:39

only starting to scratch the surface of

play12:41

what we can do with

play12:43

them here's one I really love so there's

play12:45

a video of a drone on the left and this

play12:47

is a

play12:48

butterfly underwater on the right and

play12:52

we're going to interpolate between the

play12:58

two

play13:09

and some of the Nuance it gets like for

play13:11

example that it makes the coliseum in

play13:14

the middle as it's going slowly start to

play13:17

Decay actually and going like it's

play13:19

really spectacular some of the Nuance

play13:22

that it gets right and and here's one

play13:25

that's really cool too because it's like

play13:26

how can you possibly go in the kind of

play13:29

Mediterranean landscape to this

play13:31

gingerbread house in a way that is like

play13:35

consistent with physics in the 3D world

play13:38

and it comes up with this really unique

play13:40

solution to do it that it's actually

play13:42

uded by the building and behind it you

play13:45

start to see thiser red

play13:50

[Music]

play13:56

house so I encourage you if you haven't

play13:58

we have in addition to when we released

play14:00

our main blog post we also came up with

play14:02

a technical report and the technical

play14:03

report has these examples and it has

play14:05

some other cool examples that we don't

play14:07

have any these slides too again I think

play14:09

it's really scratching the surface of

play14:10

what we could do with these models but

play14:12

check that out if you haven't there are

play14:14

some other fun things you can do like

play14:16

extending videos forward or backwards I

play14:18

think we have here one example where

play14:20

this is an image we generated this one

play14:23

with Dolly 3 and then we're going to

play14:25

animate this image using

play14:28

Sora

play14:40

sh all right now I'm going to pass it

play14:42

off to Bill to talk a bit about why this

play14:45

is important on the path to

play14:48

AI all right of course everyone's very

play14:51

bullish on the rule that llms are going

play14:53

to play in getting to AGI but we believe

play14:55

that video models are on the critical

play14:57

path to it and concrete

play14:59

we believe that when we look at very

play15:01

complex scenes of Sor Genera like that

play15:04

snowy scene in Tokyo that we saw at the

play15:06

very beginning that Sora is already

play15:08

beginning to show a detailed

play15:09

understanding of how humans interact

play15:11

with one another how they have physical

play15:13

contact with one another and as we

play15:15

continue to scale this Paradigm we think

play15:17

eventually it's going to have to model

play15:18

how humans think right the only way you

play15:20

can generate truly realistic video with

play15:22

truly realistic sequences of actions is

play15:24

if you have an internal model of how all

play15:26

objects humans Etc environments work and

play15:29

so we think this is how Sora is going to

play15:31

contribute to AI so of course the name

play15:33

of the game here as it is with LM is

play15:35

scaling and a lot of the work that we

play15:38

put into this Paradigm in order to make

play15:40

this happen was as Tim alluded to

play15:41

earlier coming up with this Transformer

play15:43

based framework that scals really eff

play15:46

and so we have here a comparison of

play15:48

different s models where the only

play15:50

difference is the amount of training

play15:51

compute that we put into the model so on

play15:53

the far left there you can see Sora with

play15:54

the base amount of compute it doesn't

play15:56

really even know how dogs look it has a

play15:58

rough sense that like camera should move

play15:59

through scenes but that's about it if

play16:01

you forx the amount of compute that we

play16:03

put in for that training one then you

play16:04

can see it now a she know what's like

play16:07

can put a hat on it and it can put a

play16:08

human in the background and if you

play16:10

really crank up the compute and you go

play16:11

to 32x base then you begin to see these

play16:14

very detailed Textures in the

play16:15

environment you see this very last

play16:16

movement with the feet and the dog's

play16:18

legs as it's navigating through the

play16:19

scene you can see that the woman's hands

play16:21

are beginning to interact with that

play16:22

mided hat and so as we continue to scale

play16:25

up Sora just as we find emergent

play16:27

capabilities in llms we we believe we're

play16:29

going to find emerging capabilities and

play16:30

video models as well and even with the

play16:33

amount of compute that we put in today

play16:35

not that 32x Mark we think there's

play16:37

already some pretty cool things that are

play16:38

happening so I'm going to spend a bit of

play16:39

time talking about that so the first one

play16:42

is complex scenes and animals so this is

play16:45

another sample for this beautiful snowy

play16:47

Tokyo City M and again you see the

play16:50

camera flying through the sea it's

play16:52

maintaining this 3D geometry this

play16:54

couple's holding hands you can see

play16:55

people at the Stalls it's able to

play16:57

simultaneously model very complex

play16:59

environment with a lot of agents in it

play17:01

so today can only do pretty basic things

play17:04

like these fairly like lowle

play17:06

interactions but as we continue to scale

play17:08

the model we think this is indicative of

play17:10

what we can expect in the future you

play17:11

know more kind of conversations between

play17:13

people which are actually substantive

play17:15

and meaningful and more complex physical

play17:17

interactions another thing that's cool

play17:19

about video models compared to llms is

play17:21

we can do anal got a great an here

play17:23

there's a lot of intelligence Beyond

play17:25

humans in this world and we can learn

play17:27

from all that intelligence we're not

play17:28

limited to one notion of it and you can

play17:31

do animals we can do dogs we really like

play17:33

this one this is a dog in barano Italy

play17:36

and you can see it's wants to just go to

play17:38

that other window s stumbles a little

play17:40

bit but it recovers so it's beginning to

play17:43

build the model not only about for

play17:44

example humans and local through scenes

play17:46

but how any

play17:49

animal another property that we're

play17:51

really excited about is this notion of

play17:53

3D consistency so there was I think a

play17:56

lot of debate at one point within the

play17:57

academic Community about the extent to

play17:59

which we need inductive biases and

play18:01

generative models to really make them

play18:03

successful and with Sora one thing that

play18:05

we wanted to do from the beginning was

play18:07

come up with a really simple and

play18:08

scalable framework that completely assu

play18:12

any kind of hard-coded inductive biases

play18:14

from humans about physics and so what we

play18:17

found is that this works so as long as

play18:18

you scale up the model enough it can

play18:20

figure out 3D geometry all by itself

play18:22

without us having to bake and break

play18:23

consistency into the model

play18:25

correctly so here's an aerial view of

play18:29

during the blue hour showcasing the

play18:30

stunning architecture of white psychotic

play18:33

buildings with Blue Domes and all these

play18:35

aerial shots we found T to be like

play18:38

pretty successful this s like you don't

play18:39

have to cherry pick too much to get it

play18:41

really does a great job at consistently

play18:43

coming up with good results

play18:45

here aerial view of Y both hikers as

play18:49

well as a g water pole they do some

play18:52

extreme hiking

play18:57

at

play19:03

[Music]

play19:06

so another property which has been

play19:08

really hard for video generation systems

play19:09

in the past but Sora has mostly figured

play19:11

out it's not perfect is object

play19:13

permanence and so we can go back to our

play19:15

favorite little scene of the Dalmation

play19:17

in barano and you can see even as a

play19:19

number of people pass by

play19:21

it to dog still there so Sora not only

play19:25

gets these kind of very like shortterm

play19:27

interactions direct like saw earlier

play19:29

with the woman passing by the blue sign

play19:30

in Tokyo but even when you have multiple

play19:32

levels of refusion can still

play19:36

Rec in order to have like a really

play19:38

awesome video generation system by

play19:40

definition what you need is for there to

play19:42

be non-trivial and really interesting

play19:43

things that happen over time in the old

play19:45

days when we were generating like 4C

play19:47

videos uh usually all we saw were like

play19:49

very light animated gifs that was what

play19:51

most video generation systems were

play19:53

capable of and Sora is definitely a step

play19:56

forward and now we're beginning to see

play19:58

signs that you can actually do like

play20:00

actions that permanently affect the

play20:02

world State and so this is i' say one of

play20:05

like the weaker aspects of sord today it

play20:07

doesn't nail this 100% of the time but

play20:09

we do see Lems of success here so I'll

play20:10

share a few here so this is a watercolor

play20:13

painting and you can see that as the

play20:16

artist is leaving brush Strokes they

play20:17

actually skip with the canvas so they're

play20:19

actually able to make a meaningful

play20:20

change to the world and you don't just

play20:21

get kind of like a blurry

play20:25

nothing so this older man with hair is

play20:29

devouring a cheeseburger wait for it

play20:32

there we go so he actually Lees a bite

play20:34

in it so these are very simple kinds of

play20:36

interactions but this is really

play20:38

essential for video generation systems

play20:40

to be useful not only for Content

play20:42

creation but also in terms of AGI and

play20:44

being able to model long range

play20:45

dependencies if someone does something

play20:47

in the distant past and you we want to

play20:48

generate a whole movie we need to make

play20:50

sure the model can remember that and

play20:51

that state is affected over time so this

play20:54

is a step for that with

play20:57

s

play20:59

when we think about Sora as a world

play21:00

simulator of course we're so excited

play21:02

about modeling our real world's physics

play21:04

and that's been a key component of this

play21:06

project but at the same time there's no

play21:08

real reason to stop there so there's

play21:10

lots of other kinds of Worlds right

play21:11

every single laptop we use every

play21:13

operating system we use has its own set

play21:15

of physics it has its own set of

play21:16

entities and objects and rules and Sora

play21:19

can learn from everything it doesn't

play21:20

just have to be a real world physics

play21:22

simulator so we're really excited about

play21:24

the prospect ass simulating literally

play21:25

everything and as a first step towards

play21:28

that

play21:29

we tried Minecraft so this is Sora and

play21:31

the prompt is Minecraft with the most

play21:33

gorgeous highres akk texture pack ever

play21:36

and you can see already Sora knows a lot

play21:38

about how Minecraft works so it's not

play21:40

only rendering this environment but it's

play21:42

also controlling the player with the

play21:44

reasonably intelligible policy it's not

play21:45

too interesting but it's doing something

play21:47

and it can model all the objects in the

play21:49

scene as well so we have another sample

play21:51

with the sand

play21:53

promps it shows a different texture pack

play21:56

this time and we're really excited about

play21:58

this notion that one day we can just

play22:00

have a singular model which really can

play22:02

encapsulate all the knowledge across all

play22:04

these world so one joke we like to say

play22:05

is you can run CHT in the video model

play22:11

eventually and now let TR a bit about

play22:13

failure cases so of course Sora has a

play22:16

long way to go this is

play22:21

really Sora has a really hard time with

play22:23

certain kinds of physical interactions

play22:24

still today that we would think as being

play22:26

very simple so like share object in Sor

play22:30

mind even simpler kinds of physics than

play22:34

this if you drop a glass and shatter if

play22:35

you try to do a sample like that s

play22:37

will'll get it wrong almost time so it

play22:39

really has a long way to go and

play22:41

understanding very basic things that we

play22:43

take for granted so we're by no means

play22:45

anywhere near the end of this yet and to

play22:48

wrap up we have a bunch of samples here

play22:50

and we go to questions I think overall

play22:52

we're really excited about where this

play22:54

Paradigm is

play22:57

going

play23:07

we don't know

play23:09

next to extend

play23:12

it so we really view this as being like

play23:14

the gpt1 of video and we think this

play23:18

technology is going to get a lot better

play23:19

very soon there's some signs of life and

play23:22

some cool properties we're already

play23:23

seeing like I just went over um but

play23:25

we're really excited about this we think

play23:27

the things that people are going build

play23:28

on top of Ms like this are going to be

play23:30

mindblowing and really amazing and we

play23:32

can't wait to see what the world does

play23:34

with it so thanks a

play23:40

lot we have 10 minutes who goes

play23:44

first all right um so question about

play23:48

like understanding the agents or having

play23:50

the agent interact with each other with

play23:52

in the scene is that piece of

play23:54

information explicit already or is it

play23:56

just the P SS and then you have to run

play23:58

like a can now talk good question so all

play24:01

this is happening implicitly and so you

play24:03

know when we see these like Minecraft

play24:04

samples we don't have any notion of

play24:07

where it's actually modeling the player

play24:09

and where it's explicitly representing

play24:10

actions within the environment so you're

play24:12

right that if you wanted to be able to

play24:14

exactly describe what is happening or

play24:16

somehow read it off you would need some

play24:17

other system on top of Sora currently to

play24:19

be able to extract that information

play24:20

currently it's all implicit in the

play24:22

princi and emplo for that matter

play24:25

everything's implicit 3D is implicit

play24:27

everything is there's no

play24:28

anything so basically the things that

play24:30

you just describ right now is all the

play24:32

cool probabilities derived from model

play24:36

like

play24:37

after cool

play24:39

that's could you talk a little bit about

play24:42

the potential for fine tuning so if you

play24:45

have a very specific character or IP I

play24:49

know for the the wave one you used an

play24:51

input image for that how do you think

play24:53

that those plugins

play24:55

or built into the process yeah great

play24:58

question so this is something we're

play25:00

really interested in in general one

play25:02

piece of feedback we've gotten from

play25:03

talking with artists is that they just

play25:05

want them all to be ask controlable as

play25:06

possible to your point if they have a

play25:08

character they really love and that

play25:09

they've designed they would love to be

play25:10

able to use that across s Generations

play25:13

it's something that's actively on our

play25:14

mind you could certainly do some kind of

play25:17

fine tuning with the model if you had a

play25:18

specific data set of your content that

play25:21

you wanted to adapt the model for um we

play25:23

don't currently we're really in like a

play25:25

stage where we're just finding out

play25:27

exactly like what people want so so this

play25:28

kind of feedback is actually great for

play25:29

us so we don't have a clear road map for

play25:31

exactly that might be possible but in

play25:33

theory it's

play25:35

probably all right on the back you okay

play25:38

so language Transformers you're like

play25:41

pying autor regressively predicting this

play25:44

like sequential manner but in

play25:45

Transformers we do like this scanline

play25:47

order maybe we do like a snake through

play25:50

the spatial domain do you see this as a

play25:52

fundamental constraint Vision

play25:53

Transformers does it matter if you do

play25:56

does the order at which you predict

play25:58

token station

play26:00

matter yeah good question in this case

play26:03

we're actually using diffusion so it's

play26:05

not an auto regressive Transformer in

play26:07

the same way that language models are

play26:09

but we're Den noising the videos that we

play26:11

generate so we start from a video that's

play26:13

entire noise and we iteratively run our

play26:17

model to remove the noise and when you

play26:19

do that enough times you remove all the

play26:21

noise and you end up with a sample and

play26:23

so we actually don't have this like scan

play26:26

line order for example because you can

play26:28

do the denoising across many SpaceTime

play26:32

Patches at the same time and for the

play26:34

most part we actually just do it across

play26:36

the entire video at the same time we

play26:38

also have a way and we get into this a

play26:40

bit in that technical report that if you

play26:42

want to you could first generate a

play26:44

shorter video and then extend it so

play26:46

that's also an option but it can be used

play26:48

in either way either you can generate

play26:49

the video all at once or you can

play26:51

generate a shorter video and extended if

play26:53

you

play26:56

like yeah so the internet Innovation was

play26:59

mostly driven by BN do you feel a need

play27:03

to pay that adult industry

play27:10

back I feel no need also

play27:15

yeah all

play27:21

right do you generate da at 30 Fram

play27:24

second or do you like frames frame

play27:27

generation at

play27:28

that all the four way slower

play27:31

than we generate 30

play27:35

FPS okay have you tried like colliding

play27:39

cars or like rotations and things like

play27:41

that to see if the image generation F

play27:45

fits into like a physical model world

play27:47

that

play27:50

OBS we've tried a few examples like that

play27:52

I'd say rotations generally tend to be

play27:55

pretty reasonable it's by no means

play27:57

perfect I've seen it couple samples from

play27:59

Sora of colliding cars I don't think

play28:01

it's quite got three laws down

play28:08

yet so what are the IND Ed that you

play28:12

trying to fix right now with Sora that

play28:17

your

play28:18

so the engagement with people external

play28:22

right now is mainly focused on artists

play28:24

and how they would use it and what

play28:26

feedback they have for being able to to

play28:28

use it and people red teamers on safety

play28:31

so that's really the two types of

play28:33

feedback that we're looking for right

play28:34

now and as Bill mentioned a really

play28:36

valuable piece of feedback we getting

play28:37

from artists the type of control they

play28:39

want for example artists often want

play28:41

control of the camera and the path of

play28:43

the camera case also and then on the

play28:46

safety concerns it's about we want to

play28:48

make sure that if we were to give wider

play28:50

access to this that it would be

play28:52

responsible and safe and there are lots

play28:53

of potential misuses for it and

play28:55

disinformation there many concerns Focus

play29:00

possible to make videos that a user

play29:02

could actually interact with it like

play29:04

through V or something so let's say like

play29:05

video is playing halfway through I stop

play29:07

it I change a few things around with

play29:09

video just like Chris would I be able to

play29:11

rest of the video incorporate those

play29:13

changes it's a great idea right now Sora

play29:15

is still pretty slow from the latency

play29:17

perspective what we generally said

play29:19

publicly is so it depends a lot on the

play29:21

exact parameters of the generation

play29:22

duration resolution if you're cranking

play29:25

out this thing it's going to take at

play29:26

least a couple minutes and so we're

play29:28

still I'd say a way is off from the kind

play29:30

of experience you're describing but I

play29:32

think it' be really cool

play29:34

thanks what were your stated goals in

play29:37

building this first version and what

play29:39

were some problems that you had along

play29:41

the way that you learned

play29:43

from I'd say the overarching goal was

play29:46

really always to get to 1080p at least

play29:49

30 seconds from like the early days of

play29:51

the project so we felt like video

play29:53

generation was stuck in the Rut of this

play29:56

4 second like J generation

play29:58

and so that was really the key focus of

play30:00

the team throughout the project along

play30:02

the way I think we discovered how

play30:04

painful it is to work with video data

play30:06

it's a lot of pixels in these videos and

play30:08

it's a lot of just very detailed boring

play30:12

engineering work that needs to get done

play30:14

to really make these systems work and I

play30:17

I think we knew going into it that it

play30:19

would involve a lot of elbow grease in

play30:20

that regard but yeah it certainly took

play30:22

some time so I don't know any other

play30:24

findings along the way yeah I mean we

play30:28

tried really hard to keep the method

play30:30

really simple and that is sometimes

play30:32

easier said than done but I think that

play30:34

was a big focus of just let's do the

play30:36

simplest thing we possibly can and

play30:39

really scale it and do the scaling

play30:43

prop did you do the prom and see the

play30:46

output it's not good enough then you go

play30:48

TR again do the same prom and then it's

play30:51

there that's first video then you do

play30:54

more than training than the new prom and

play30:58

new video is that the process you use in

play31:00

this reling the

play31:02

videos that's a good question evaluation

play31:05

is challenging for videos we use a

play31:07

combination of things one is your actual

play31:10

loss and low loss is correlated with

play31:12

models that are better so that can help

play31:14

another is you can evaluate the quality

play31:17

of individual frames using image metrics

play31:19

so we do use standard image metrics to

play31:21

evaluate the quality frames and then we

play31:23

also did spend quite a lot of time

play31:26

generating samples and looking at them

play31:28

ourselves although in that case it's

play31:29

important that you do it across a lot of

play31:31

samples and not just individual prompts

play31:34

because sometimes this process is noisy

play31:36

so you might randomly get a good sample

play31:38

and think that you Improvement so this

play31:40

would be like you compare Lots ofrs in

play31:42

the

play31:48

outputs uh we can't comment on that one

play31:51

last

play31:54

question thanks for a great talk so my

play31:56

question is on the training data so how

play31:58

much training data do you estimate that

play32:00

is required for us to get to AGI and do

play32:02

you think we have enough data on the

play32:06

internet yeah that's a good question I

play32:08

think we have enough dat it to get to

play32:10

AGI and I also think people always come

play32:14

up with creative ways to improve things

play32:16

and when we hit limitations we find

play32:19

creative ways to improve regardless so I

play32:23

think that whatever data we have will be

play32:25

enough to get the AG wonderful okay

play32:28

that's to AI thank

play32:31

[Applause]

play32:34

you

Rate This

5.0 / 5 (0 votes)

相关标签
AI视频内容创作技术革新人工智能视频生成3D空间对象持久性创意表达未来媒体计算扩展模拟世界