Sora Creator “Video generation will lead to AGI by simulating everything” | AGI House Video

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
7 Apr 202432:36

Summary

TLDRThe transcript discusses the development of a video generation model named Sora, which aims to revolutionize content creation and contribute to the path towards Artificial General Intelligence (AGI). Sora demonstrates the ability to generate high-definition, minute-long videos with complex scenes and object permanence. The model is trained on a diverse range of visual data, scaling up to improve its capabilities. The potential applications of Sora are vast, from creating realistic special effects to animating content and even simulating different worlds. The developers are engaging with artists and red teamers to refine the technology and ensure its responsible use.

Takeaways

  • 🌟 The Aji House team, including Tim and Bill, introduced a new AI-generated video technology that can produce high-definition, minute-long videos with complex details like reflections and shadows.
  • 🚀 A significant goal was achieved with the creation of videos that are 1080p and a minute long, marking a leap forward in video generation technology.
  • 🎨 The technology allows for various styles, including a paper craft world and the ability to understand and generate content in a full 3D space, showcasing the geometry and physical complexities.
  • 🤖 The AI has learned intelligence about the physical world through training on videos, indicating its potential to revolutionize content creation and contribute to the path towards Artificial General Intelligence (AGI).
  • 🎬 The technology can generate content with consistent character appearances across multiple shots without the need for manual editing or compositing.
  • 🏙️ There are implications for special effects and Hollywood, as the AI can create fantastical effects that would typically be expensive in traditional CGI pipelines.
  • 💡 The technology's potential extends beyond photorealistic content, as it can also generate animated content and scenes that would be difficult to shoot with traditional infrastructure.
  • 🎨 Artists have been given access to the technology, and their feedback highlights the desire for more control over the generated content, such as camera control and character representation.
  • 🔧 The technology is still in the research phase and is not yet a product available to the public, with the team focusing on artist engagement and safety considerations.
  • 📈 As with language models, the key to improving the technology is scaling, with the expectation that increased compute and data will lead to better performance and more emergent capabilities.

Q & A

  • What was the primary goal for the Aji House team in developing their video generation technology?

    -The primary goal for the Aji House team was to create high-definition, minute-long videos, marking a significant leap in video generation capabilities.

  • What challenges did the team face in achieving object permanence and consistency in their generated videos?

    -Object permanence and consistency over long durations were challenging because the model needed to understand that an object, such as a blue sign, remains present even after a character walks in front of it and passes by.

  • How does the video generation technology impact content creation and special effects?

    -The technology has the potential to revolutionize content creation and special effects by enabling the generation of complex scenes and fantastical effects that would normally be expensive to produce using traditional CGI pipelines in Hollywood.

  • What is the significance of the video generation technology in the path towards Artificial General Intelligence (AGI)?

    -The video generation technology is seen as a critical step towards AGI because it not only generates content but also learns intelligence about the physical world, contributing to a more comprehensive understanding of environments and interactions.

  • How does the technology handle different video styles and 3D spaces?

    -The technology can adapt to various video styles, such as paper craft worlds, and understand 3D spaces by comprehending geometry and physical complexities, allowing for camera movements through 3D environments with people moving within them.

  • What are some of the unique capabilities of the video generation model 'Sora'?

    -Sora can generate videos with different aspect ratios, perform zero-shot video style transfers, interpolate between different videos, and even simulate different worlds, such as Minecraft, with a high level of detail and understanding of the environment's physics.

  • How does the team at Aji House engage with external artists and red teamers to refine the technology?

    -The team provides access to a small pool of artists and red teamers to gather feedback on how the technology can be made more valuable and safe, ensuring responsible use and addressing potential misuses.

  • What are some examples of creative applications of the video generation technology?

    -Examples include creating a movie trailer featuring a 30-year-old spaceman, an alien blending in New York City, a scuba diver discovering a futuristic shipwreck, and a variety of animated content with unique styles and themes.

  • How does the training process for the video generation model differ from language models?

    -While language models use auto-regressive methods, the video generation model uses diffusion, starting from noise and iteratively removing it to produce a video, allowing for both whole video generation or extension of shorter videos.

  • What are the future prospects for the 'Sora' video generation technology?

    -The future prospects for Sora include further refinement of the model, increased control for artists over specific elements like camera paths, and the potential for simulating a wide range of worlds and environments beyond the physical world.

Outlines

00:00

🌟 Introduction to Aji House and Video Generation Achievements

The paragraph introduces the audience to Aji House, an organization that honors people like the attendees present. It highlights the achievements in video generation, emphasizing the creation of an HD, one-minute long video, which was a significant goal. The complexity of the video, including reflections, shadows, and object permanence, is discussed. The video's ability to learn about 3D space and physical complexities is also mentioned. The speakers, Tim and Bill, express excitement about the opportunities video generation presents for content creation and its role in advancing AI.

05:01

🎨 Artistic Collaborations and the Potential of Video Generation

This paragraph delves into the collaboration with artists and the research conducted to understand the value and safety of the video generation technology. It emphasizes the importance of external engagement and the role of red teamers in safety work. The artist Shai Kids is highlighted for their quote on the ability of the technology to create surreal imagery. The paragraph also showcases a video made by Shai Kids, demonstrating the technology's potential for new and unique forms of media and entertainment. The blog post 'Sora First Impressions' and the creative works of various artists are mentioned to illustrate the technology's versatility.

10:02

🚀 Scaling Video Models for AI and Content Creation

The focus of this paragraph is on the technological aspects of scaling video models and their importance in the journey towards AGI. It discusses the methodology of using SpaceTime patches as a foundation for training Transformers, similar to language models with tokens. The ability to generate videos with multiple aspect ratios and the use of zero-shot video capabilities are also covered. The paragraph highlights the creative potential of these models, comparing their future impact to the evolution of language models. It also touches on the technical report and additional examples that showcase the technology's capabilities.

15:04

🤖 Understanding Human Interaction and 3D Consistency

This paragraph discusses the detailed understanding of human interaction and 3D geometry that Sora is beginning to exhibit. It emphasizes the importance of scaling the model to achieve a realistic video generation, which requires an internal model of how objects, humans, and environments work. The paragraph also covers the potential for more complex interactions and the ability of video models to learn from various forms of intelligence. The concept of 3D consistency and object permanence is explored, along with the challenges that remain in perfecting these aspects. The paragraph concludes with a discussion on the potential of Sora as a world simulator, including its application in environments like Minecraft.

20:05

🛠️ Fine-Tuning, Industry Applications, and Future Directions

The final paragraph focuses on the potential for fine-tuning the model for specific characters or IPs, the importance of artist control, and the current stage of development. It addresses the use of diffusion in the video generation process, the lack of a fundamental constraint like scanline order in Vision Transformers, and the possibility of generating videos at 30 FPS. The paragraph also discusses the engagement with artists and red teamers, the focus on safety and responsible use, and the potential for interactive videos. The challenges in video data processing and the engineering work required are acknowledged, along with the goals achieved in building the first version of the technology.

25:05

🌐 Training Data and the Path to AGI

The paragraph concludes the discussion by addressing the question of training data requirements for AGI and whether the internet provides enough data. The speaker expresses confidence in the sufficiency of available data and the creativity of people in overcoming limitations. The talk ends with a positive outlook on the potential of AI and the impact of the technology discussed.

Mindmap

Keywords

💡AI

Artificial Intelligence (AI) refers to the simulation of human intelligence in machines that are programmed to think like humans and mimic their actions. In the context of the video, AI is central to the discussion about the development of advanced video generation capabilities and its role in achieving AGI (Artificial General Intelligence), which is the intelligence of a machine that could successfully perform any intellectual task that a human being can.

💡Video Generation

Video generation is the process of creating or synthesizing video content using computational methods. In the video, this term is used to describe the technology's ability to produce complex, minute-long, high-definition videos with intricate details and realistic movements, which was a significant goal for the developers.

💡Object Permanence

Object permanence is the concept that objects continue to exist even when they are not within the observer's line of sight or immediate sensory range. In the context of the video, this is a critical aspect of video generation technology, as it ensures that objects in the generated video maintain their existence and properties consistently, regardless of changes in perspective or scene transitions.

💡3D Geometry

3D Geometry refers to the mathematical study of shapes, sizes, positions, and properties of objects in three-dimensional space. In the video, the ability to understand and accurately represent 3D geometry is crucial for the video generation technology to create realistic and immersive virtual environments where the spatial relationships between objects and the camera are correctly portrayed.

💡Content Creation

Content creation involves the production of original content, such as videos, images, or text, for various platforms and purposes. In the video, content creation is a key application of the discussed technology, as it has the potential to revolutionize how media is produced by enabling the generation of complex and engaging video content without the need for traditional, resource-intensive methods.

💡Transformers

Transformers are a type of deep learning model architecture that is primarily used for natural language processing tasks but has been adapted for other applications, such as video generation. In the video, the term refers to the underlying technology that allows the model to process and generate video content by training on 'SpaceTime patches,' which are akin to tokens in language models.

💡Denoising

Denoising is the process of removing noise from a signal or data, such as a video, to enhance its quality. In the context of the video, denoising is part of the generative process where the model starts with a noisy video and iteratively applies transformations to reduce the noise, ultimately producing a clear and detailed video output.

💡Stable Diffusion

Stable Diffusion is a term that likely refers to a stable version of a diffusion model, which is a type of generative model used for creating new data samples. In the video, it is mentioned as part of the process of generating videos, suggesting that the technology may use diffusion models to create smooth and coherent video sequences.

💡Collaborators

Collaborators in this context refer to the team of individuals who have worked together to develop the video generation technology. They are likely a diverse group of experts in fields such as AI, computer vision, and software engineering, all contributing their unique skills to achieve the common goal of creating advanced video generation capabilities.

💡Special Effects

Special effects, or visual effects, are the processes by which film makers create illusions or visual enhancements that are not possible to achieve in live-action shooting. In the video, the term is used to discuss how the video generation technology can be used to create fantastical and complex special effects that would otherwise be very expensive or challenging to produce using traditional methods.

💡Photorealistic

Photorealistic refers to the creation of images or videos that are incredibly realistic and indistinguishable from photographs or real-life scenes. In the video, the technology's ability to generate photorealistic content is highlighted, emphasizing its potential to produce high-quality, believable virtual environments and characters.

Highlights

Aji House honors people like the attendees, showcasing their importance in the creation process.

The introduction of an AI-generated video that is HD and a minute long, marking a significant leap in video generation technology.

Complexity in the video is noted, such as reflections and shadows, indicating advancements in video generation capabilities.

The video generation technology demonstrates object permanence and consistency over long durations, which is a challenging problem to solve.

The technology can produce various styles, such as a paper craft world, showcasing its versatility.

The AI understands the geometry and physical complexities of 3D spaces, indicating its learning capabilities about the physical world.

The potential of video generation to revolutionize content creation is discussed, with a focus on its impact on various industries.

A sample movie trailer is shown, featuring an astronaut persisting across multiple shots, demonstrating the technology's ability to create连贯 narratives.

The technology's implications for special effects in the film industry are explored, highlighting its potential to reduce costs and increase creativity.

The technology can generate photorealistic content as well as animated content, expanding the range of creative possibilities.

A unique scene with a blend of animals and a jewelry store is presented, showcasing the technology's ability to create complex, fantastical environments.

The technology's potential to democratize content creation is discussed, enabling more people to bring their creative visions to life.

The process of training the AI on a vast array of text data is explained, drawing parallels between language models and the visual AI being developed.

The concept of aspect ratios and how they can be used to generate content in various formats is discussed, highlighting the technology's adaptability.

The use of zero-shot video capabilities is described, allowing for the generation of videos in different styles from a single input.

The ability to interpolate between videos and create seamless transitions between different creatures or scenes is showcased.

The potential for the technology to contribute to AI by modeling how humans think and interact is discussed, emphasizing its importance on the path to AGI.

The importance of scaling the model and increasing compute is emphasized as a key factor in improving the technology.

The technology's ability to model complex scenes and interactions between agents, such as people and animals, is highlighted as a sign of future capabilities.

Transcripts

play00:01

here's here's one thing about Aji house

play00:04

we honor the people like you guys that's

play00:06

why we have you here that's why you're

play00:08

all here without the further Ado we're

play00:12

[Applause]

play00:17

going awesome what a big fun craft i'm

play00:20

Tim this is Bill and we made s at open

play00:23

ey together with a team of amazing

play00:26

collaborators we're excited to tell you

play00:28

a bit about it today we'll talk a bit

play00:30

about at a high level what it does some

play00:33

of the opportunities it has to impact

play00:35

content creation some of the technology

play00:38

behind it as well as why this is an

play00:40

important step on the path to

play00:43

AI so without further Ado here is a sord

play00:46

generated video this one is really

play00:49

special to us because it's HD and a

play00:51

minute long and that was a big goal of

play00:53

ours when we're trying to figure out

play00:56

like what would really make a leap for

play00:57

video generation you want 1080 fet

play01:00

videos that are a minute long this video

play01:02

does that you can see it has a lot of

play01:03

complexity too like in the reflections

play01:06

and the Shadows one really interesting

play01:08

point that sign that blue sign she's

play01:09

about to walk in front of

play01:12

it and after she passes the sign still

play01:15

exists afterwards that's a really hard

play01:18

problem for video generation to get this

play01:19

type of object permanence and

play01:21

consistency over a long

play01:24

duration so I can

play01:27

do here we go a number of different St

play01:29

Styles too this is a paper craft world

play01:33

that I can imagine so that's really cool

play01:39

and and also it can learn about a full

play01:43

3d space so here the camera moves

play01:45

through 3D as people are moving but it

play01:48

really understands the geometry and the

play01:50

physical complexities of the so Sor

play01:53

learned a lot in addition to just being

play01:56

able to generate content it's actually

play01:58

learned a lot of intelligence about the

play02:00

physical world just from training when

play02:02

videos and now we'll talk a bit about

play02:06

some of the opportunities with video

play02:07

generation for revolutionizing creation

play02:10

alluded to we're really excited about

play02:12

Sora not only because we view it as

play02:15

being on the critical path towards AGI

play02:17

but also in the short term for what it's

play02:19

going to do for Content so this is one

play02:21

sample we like a lot so the front in the

play02:23

bottom left here

play02:25

is a movie trailer featuring The

play02:27

Adventures of the 30-year-old Spaceman

play02:29

hardest part of doing video by the way

play02:31

is always just getting PowerPoint to

play02:33

work with

play02:40

it there we

play02:41

[Music]

play02:43

go all right now we're what's cool about

play02:46

this sample in particular is that this

play02:48

astronaut is persisting across multiple

play02:50

shots which are all generated by we

play02:51

didn't Stitch this together we didn't

play02:52

have to do a bunch of outakes and then

play02:54

create a composite shot at the end sword

play02:57

decides where it wants to change the

play02:58

camera but it do know that it's going to

play02:59

put the same astronaut in a bunch of

play03:01

different environments likewise we think

play03:03

there's a lot of cool implications for

play03:05

special effects this is one of our

play03:06

favorite samples too an alien blending

play03:08

in naturally New York City paranoia

play03:09

Thriller style 35 mil and already you

play03:12

can see that the model is able to create

play03:14

these very Fantastical effects which

play03:16

would normally be very expensive in

play03:17

traditional CGI pipelines for Hollywood

play03:19

there's a lot of implications here for

play03:21

what this technology is going to bring

play03:23

shortterm of course we can do other

play03:25

kinds of effects too so this is more of

play03:28

a Sci-Fi scene so it's a scuba diver

play03:30

discovering a hidden futuristic

play03:31

shipwreck with cybernetic marine life

play03:33

and advanced alien

play03:34

[Music]

play03:37

technology as someone who's seen so much

play03:40

incredible content from people on the

play03:42

internet who don't necessarily have

play03:43

access to tools like Sora to bring their

play03:45

Visions to life they come up with like

play03:47

cool phosy post them on Reddit or

play03:48

something it's really exciting to think

play03:50

about what people are going toble to do

play03:52

with this

play03:52

technology of course it can do more than

play03:55

just photo realistic style you can also

play03:58

do animated content

play04:00

really ADOT my favorite part of this one

play04:02

is the spell

play04:04

otter a little bit of

play04:10

charm and I think another example of

play04:14

just how cool this technology is is when

play04:16

we start to think about scenes which

play04:18

would be very difficult to shoot with

play04:20

traditional Hollywood kind of

play04:22

infrastructure the problems here is the

play04:23

Blom Zoo shop in New York City is with

play04:25

the jewelry store and Zoo saber-tooth

play04:27

tigers with diamond and gold adornments

play04:29

Turtles with listening Emerald shells

play04:30

Etc and what I love about this shot is

play04:33

it's photo realistic but this is

play04:35

something that would be incredibly hard

play04:36

to accomplish with traditional tools

play04:38

that they have in Hollywood today this

play04:39

kind of shot would of course require CGI

play04:41

it would be very difficult to get real

play04:43

world animals into these kinds of scenes

play04:45

but with Sora is pretty trivial and it

play04:46

can just do it

play04:49

on so I'll hand it over to Tim to chat a

play04:52

bit about how we're working with artists

play04:53

today with sort to see what they're able

play04:55

to do yeah so we just came out with

play04:58

pretty recently we given access to a

play05:00

small pool of artists and maybe even to

play05:03

take a step back this isn't yet a

play05:05

product or something that is available

play05:07

to a lot of people it's not in chat TBT

play05:09

or anything but this is research that

play05:11

we've done and we think that the best

play05:13

way to figure out how this technology

play05:16

will be valuable and also how to make it

play05:17

safe is to engage with people external

play05:19

from oration so that's why we came out

play05:22

with this announcement and when we came

play05:24

out with the announcement we started

play05:25

working with small teams of red teamers

play05:27

who helped with the safety work as well

play05:29

as artist and people who will use this

play05:31

technology so Shai kids is one of the

play05:33

artists that we work with and I really

play05:36

like this quote from them as greatest

play05:38

Sora is at generating things that appear

play05:40

real what excites us is the ability to

play05:42

make things that are totally surreal and

play05:45

I think that's really cool because when

play05:47

you immediately think about oh

play05:49

generating videos we have all these

play05:52

existing uses of videos that we know of

play05:54

in our lives and we quickly think about

play05:56

turning those oh maybe stock videos or

play05:58

existing films but what's really

play06:00

exciting to me is what totally new

play06:02

things are people into what completely

play06:04

new forms of media and entertainment and

play06:07

just new experiences for people that

play06:08

we've never seen before are going to be

play06:10

enabled by by Sora and by Future

play06:12

versions of video and media generation

play06:15

technology and now I want to show this

play06:18

fun video that shy kids made using Sor

play06:22

when we gave access to

play06:24

them oh okay it has audio unfortunately

play06:27

I guess we don't have that hooked up

play06:37

it's this cute

play06:39

plot about this guy with the balloon

play06:43

head you should really go and check it

play06:45

out we came out with this blog post Sora

play06:47

First Impressions and we have videos

play06:49

from a number of artists that we've

play06:51

access to and there's this really cute

play06:53

monologue to his guys talking about life

play06:56

from the different perspective of me as

play06:58

a guy with a balloon head right and this

play07:00

is just awesome and so creative and the

play07:03

other artists we've been access to have

play07:05

done really creative and totally

play07:06

different things from this too like the

play07:08

way each artist uses this is just so

play07:11

different from each other artist which

play07:12

is really exciting because that says a

play07:13

bit about the breath of ways that you

play07:16

can use this technology but this is just

play07:19

really fun and there are so many people

play07:21

with such brilliant ideas as Phil was

play07:24

talking about that maybe it would be

play07:26

really hard to do things like this or to

play07:27

make their film or their thing that's

play07:29

not a film that's totally new and

play07:31

different and hopefully this technology

play07:33

will really democratize content Creation

play07:35

in the long run it Ena so many more

play07:37

people with creative ideas to be able to

play07:39

bring those to life and show

play07:45

them I'm want to talk a bit about some

play07:47

of the technology behind soil so I'll

play07:51

talk about it from the perspective of

play07:53

language models and what has made them

play07:57

work so well is the ability to scale and

play08:00

better lesson that methods that improve

play08:02

with scale in the long run are the

play08:04

methods that will win out as you

play08:05

increase compute because over time we

play08:08

have more and more compute and if

play08:09

methods utilize that well then they will

play08:12

get better and better and language

play08:15

models are able to do that in part

play08:17

because they take all different forms of

play08:19

text you take math and code and fros and

play08:23

whatever is out there and you turn it

play08:25

all into this universal language of

play08:27

tokens and then you train

play08:30

these big Transformer models on all

play08:32

these different types of tokens this

play08:34

this kind of universal model of Text

play08:37

data and by training on this vast array

play08:40

of different types of text you learn

play08:43

these really generalist models of

play08:45

language you can do all these things

play08:47

right you can use chat gbt or whatever

play08:49

your favorite language model is to do

play08:51

all different kinds of tasks and it has

play08:53

such a breadth of knowledge that it's

play08:55

learn from the combination of this

play08:58

variety of data and we want to do the

play09:00

same thing for visual here that's

play09:01

exactly what we did with Sora so we take

play09:03

vertical videos and images and square

play09:07

images low resolution high resolution

play09:10

wide aspect ratio and we turn them into

play09:12

patches and a patch is this little cube

play09:16

in SpaceTime that you can imagine a

play09:18

stack of frames a video is like a stack

play09:21

of images that are all the frames and we

play09:23

have this volume of pixels and then we

play09:27

take these little cubes from inside and

play09:29

you can do that on any volume of pixels

play09:31

whether that's a high resolution image a

play09:33

low resolution image regardless of the

play09:35

aspect ratio long videos short videos

play09:38

you turn all of them into these

play09:40

SpaceTime patches and those are our

play09:42

equivalent of tokens and then we train

play09:45

Transformers on these SpaceTime patches

play09:49

and Transformers are really scalable and

play09:51

that allows us to think of this problem

play09:53

in the same way that people think about

play09:55

language problems of how do we get

play09:58

really good at scal them and making

play10:00

methods such that as we increase the

play10:02

compute as we increase the data they

play10:03

just get better and

play10:07

better training on multiple aspect

play10:10

ratios also allows us to generate with

play10:12

multiple aspect

play10:15

ratios there we go so here's the same

play10:17

prompt and you can generate vertical and

play10:20

square horizontal that's also a nice it

play10:23

in addition to the fact that allows you

play10:24

to use more data which is really Valu

play10:26

you want to use all the data in its

play10:28

native format as it exists it also gives

play10:31

you more diverse ways to use the V so I

play10:34

actually think vertical videos are

play10:37

really nice like we look at content all

play10:38

the time on our phones right so it's

play10:39

nice to actually be able to generate

play10:41

vertical and horizontal and a variety of

play10:44

things and we can also use zero shot

play10:47

some videoo video capabilities so this

play10:50

uses a method which is a method that's

play10:51

commonly used with diffusion we can

play10:53

apply that our model uses to Fusion

play10:55

which means that it Den noises the video

play10:58

starting from noise

play10:59

in order to create the video itely noise

play11:02

so we use this method called sedit and

play11:04

apply it and this allows us to change an

play11:07

input video the offer left it's all

play11:10

generated but it could be a real image

play11:11

then we say rewrite the video in pixel

play11:14

art style put the video in space with

play11:16

the Rainbow Road or change the video to

play11:18

a Medieval theme and you can see that it

play11:20

edits the video but it keeps the

play11:22

structure the same so in in a second it

play11:24

will go through a tunnel for example and

play11:27

it interprets that tunnel in all these

play11:28

different ways this medieval one is

play11:30

pretty amazing right because the model

play11:32

is also intelligent so it's not just

play11:34

changing something shallow about it but

play11:36

it's medieval we don't really have a car

play11:38

so I'm going to make a horse

play11:41

scage and another fun capability that

play11:45

the model has is to interpolate between

play11:49

videos so here we have two different

play11:52

creatures and this video in the middle

play11:55

starts with the left and it's going to

play11:56

end with the right and it's able to do

play11:59

it in this really

play12:01

seamless and amazing

play12:03

[Music]

play12:06

way so I think something that the past

play12:09

slide in this slide really point out is

play12:11

that there are so many unique and

play12:13

creative things you can potentially do

play12:15

with these models and similar to how

play12:18

when we first had language models

play12:20

obviously people were like oh like you

play12:22

can use it for writing right okay yes

play12:24

you can but there are so many other

play12:26

things you can do with language models

play12:27

and we're only we're even now like every

play12:29

day people coming up with some creative

play12:31

new cool thing you can do with the

play12:33

language model the same thing's going to

play12:34

happen for these visual models there are

play12:36

so many creative interesting ways in

play12:38

which we can use them and I think we're

play12:39

only starting to scratch the surface of

play12:41

what we can do with

play12:43

them here's one I really love so there's

play12:45

a video of a drone on the left and this

play12:47

is a

play12:48

butterfly underwater on the right and

play12:52

we're going to interpolate between the

play12:58

two

play13:09

and some of the Nuance it gets like for

play13:11

example that it makes the coliseum in

play13:14

the middle as it's going slowly start to

play13:17

Decay actually and going like it's

play13:19

really spectacular some of the Nuance

play13:22

that it gets right and and here's one

play13:25

that's really cool too because it's like

play13:26

how can you possibly go in the kind of

play13:29

Mediterranean landscape to this

play13:31

gingerbread house in a way that is like

play13:35

consistent with physics in the 3D world

play13:38

and it comes up with this really unique

play13:40

solution to do it that it's actually

play13:42

uded by the building and behind it you

play13:45

start to see thiser red

play13:50

[Music]

play13:56

house so I encourage you if you haven't

play13:58

we have in addition to when we released

play14:00

our main blog post we also came up with

play14:02

a technical report and the technical

play14:03

report has these examples and it has

play14:05

some other cool examples that we don't

play14:07

have any these slides too again I think

play14:09

it's really scratching the surface of

play14:10

what we could do with these models but

play14:12

check that out if you haven't there are

play14:14

some other fun things you can do like

play14:16

extending videos forward or backwards I

play14:18

think we have here one example where

play14:20

this is an image we generated this one

play14:23

with Dolly 3 and then we're going to

play14:25

animate this image using

play14:28

Sora

play14:40

sh all right now I'm going to pass it

play14:42

off to Bill to talk a bit about why this

play14:45

is important on the path to

play14:48

AI all right of course everyone's very

play14:51

bullish on the rule that llms are going

play14:53

to play in getting to AGI but we believe

play14:55

that video models are on the critical

play14:57

path to it and concrete

play14:59

we believe that when we look at very

play15:01

complex scenes of Sor Genera like that

play15:04

snowy scene in Tokyo that we saw at the

play15:06

very beginning that Sora is already

play15:08

beginning to show a detailed

play15:09

understanding of how humans interact

play15:11

with one another how they have physical

play15:13

contact with one another and as we

play15:15

continue to scale this Paradigm we think

play15:17

eventually it's going to have to model

play15:18

how humans think right the only way you

play15:20

can generate truly realistic video with

play15:22

truly realistic sequences of actions is

play15:24

if you have an internal model of how all

play15:26

objects humans Etc environments work and

play15:29

so we think this is how Sora is going to

play15:31

contribute to AI so of course the name

play15:33

of the game here as it is with LM is

play15:35

scaling and a lot of the work that we

play15:38

put into this Paradigm in order to make

play15:40

this happen was as Tim alluded to

play15:41

earlier coming up with this Transformer

play15:43

based framework that scals really eff

play15:46

and so we have here a comparison of

play15:48

different s models where the only

play15:50

difference is the amount of training

play15:51

compute that we put into the model so on

play15:53

the far left there you can see Sora with

play15:54

the base amount of compute it doesn't

play15:56

really even know how dogs look it has a

play15:58

rough sense that like camera should move

play15:59

through scenes but that's about it if

play16:01

you forx the amount of compute that we

play16:03

put in for that training one then you

play16:04

can see it now a she know what's like

play16:07

can put a hat on it and it can put a

play16:08

human in the background and if you

play16:10

really crank up the compute and you go

play16:11

to 32x base then you begin to see these

play16:14

very detailed Textures in the

play16:15

environment you see this very last

play16:16

movement with the feet and the dog's

play16:18

legs as it's navigating through the

play16:19

scene you can see that the woman's hands

play16:21

are beginning to interact with that

play16:22

mided hat and so as we continue to scale

play16:25

up Sora just as we find emergent

play16:27

capabilities in llms we we believe we're

play16:29

going to find emerging capabilities and

play16:30

video models as well and even with the

play16:33

amount of compute that we put in today

play16:35

not that 32x Mark we think there's

play16:37

already some pretty cool things that are

play16:38

happening so I'm going to spend a bit of

play16:39

time talking about that so the first one

play16:42

is complex scenes and animals so this is

play16:45

another sample for this beautiful snowy

play16:47

Tokyo City M and again you see the

play16:50

camera flying through the sea it's

play16:52

maintaining this 3D geometry this

play16:54

couple's holding hands you can see

play16:55

people at the Stalls it's able to

play16:57

simultaneously model very complex

play16:59

environment with a lot of agents in it

play17:01

so today can only do pretty basic things

play17:04

like these fairly like lowle

play17:06

interactions but as we continue to scale

play17:08

the model we think this is indicative of

play17:10

what we can expect in the future you

play17:11

know more kind of conversations between

play17:13

people which are actually substantive

play17:15

and meaningful and more complex physical

play17:17

interactions another thing that's cool

play17:19

about video models compared to llms is

play17:21

we can do anal got a great an here

play17:23

there's a lot of intelligence Beyond

play17:25

humans in this world and we can learn

play17:27

from all that intelligence we're not

play17:28

limited to one notion of it and you can

play17:31

do animals we can do dogs we really like

play17:33

this one this is a dog in barano Italy

play17:36

and you can see it's wants to just go to

play17:38

that other window s stumbles a little

play17:40

bit but it recovers so it's beginning to

play17:43

build the model not only about for

play17:44

example humans and local through scenes

play17:46

but how any

play17:49

animal another property that we're

play17:51

really excited about is this notion of

play17:53

3D consistency so there was I think a

play17:56

lot of debate at one point within the

play17:57

academic Community about the extent to

play17:59

which we need inductive biases and

play18:01

generative models to really make them

play18:03

successful and with Sora one thing that

play18:05

we wanted to do from the beginning was

play18:07

come up with a really simple and

play18:08

scalable framework that completely assu

play18:12

any kind of hard-coded inductive biases

play18:14

from humans about physics and so what we

play18:17

found is that this works so as long as

play18:18

you scale up the model enough it can

play18:20

figure out 3D geometry all by itself

play18:22

without us having to bake and break

play18:23

consistency into the model

play18:25

correctly so here's an aerial view of

play18:29

during the blue hour showcasing the

play18:30

stunning architecture of white psychotic

play18:33

buildings with Blue Domes and all these

play18:35

aerial shots we found T to be like

play18:38

pretty successful this s like you don't

play18:39

have to cherry pick too much to get it

play18:41

really does a great job at consistently

play18:43

coming up with good results

play18:45

here aerial view of Y both hikers as

play18:49

well as a g water pole they do some

play18:52

extreme hiking

play18:57

at

play19:03

[Music]

play19:06

so another property which has been

play19:08

really hard for video generation systems

play19:09

in the past but Sora has mostly figured

play19:11

out it's not perfect is object

play19:13

permanence and so we can go back to our

play19:15

favorite little scene of the Dalmation

play19:17

in barano and you can see even as a

play19:19

number of people pass by

play19:21

it to dog still there so Sora not only

play19:25

gets these kind of very like shortterm

play19:27

interactions direct like saw earlier

play19:29

with the woman passing by the blue sign

play19:30

in Tokyo but even when you have multiple

play19:32

levels of refusion can still

play19:36

Rec in order to have like a really

play19:38

awesome video generation system by

play19:40

definition what you need is for there to

play19:42

be non-trivial and really interesting

play19:43

things that happen over time in the old

play19:45

days when we were generating like 4C

play19:47

videos uh usually all we saw were like

play19:49

very light animated gifs that was what

play19:51

most video generation systems were

play19:53

capable of and Sora is definitely a step

play19:56

forward and now we're beginning to see

play19:58

signs that you can actually do like

play20:00

actions that permanently affect the

play20:02

world State and so this is i' say one of

play20:05

like the weaker aspects of sord today it

play20:07

doesn't nail this 100% of the time but

play20:09

we do see Lems of success here so I'll

play20:10

share a few here so this is a watercolor

play20:13

painting and you can see that as the

play20:16

artist is leaving brush Strokes they

play20:17

actually skip with the canvas so they're

play20:19

actually able to make a meaningful

play20:20

change to the world and you don't just

play20:21

get kind of like a blurry

play20:25

nothing so this older man with hair is

play20:29

devouring a cheeseburger wait for it

play20:32

there we go so he actually Lees a bite

play20:34

in it so these are very simple kinds of

play20:36

interactions but this is really

play20:38

essential for video generation systems

play20:40

to be useful not only for Content

play20:42

creation but also in terms of AGI and

play20:44

being able to model long range

play20:45

dependencies if someone does something

play20:47

in the distant past and you we want to

play20:48

generate a whole movie we need to make

play20:50

sure the model can remember that and

play20:51

that state is affected over time so this

play20:54

is a step for that with

play20:57

s

play20:59

when we think about Sora as a world

play21:00

simulator of course we're so excited

play21:02

about modeling our real world's physics

play21:04

and that's been a key component of this

play21:06

project but at the same time there's no

play21:08

real reason to stop there so there's

play21:10

lots of other kinds of Worlds right

play21:11

every single laptop we use every

play21:13

operating system we use has its own set

play21:15

of physics it has its own set of

play21:16

entities and objects and rules and Sora

play21:19

can learn from everything it doesn't

play21:20

just have to be a real world physics

play21:22

simulator so we're really excited about

play21:24

the prospect ass simulating literally

play21:25

everything and as a first step towards

play21:28

that

play21:29

we tried Minecraft so this is Sora and

play21:31

the prompt is Minecraft with the most

play21:33

gorgeous highres akk texture pack ever

play21:36

and you can see already Sora knows a lot

play21:38

about how Minecraft works so it's not

play21:40

only rendering this environment but it's

play21:42

also controlling the player with the

play21:44

reasonably intelligible policy it's not

play21:45

too interesting but it's doing something

play21:47

and it can model all the objects in the

play21:49

scene as well so we have another sample

play21:51

with the sand

play21:53

promps it shows a different texture pack

play21:56

this time and we're really excited about

play21:58

this notion that one day we can just

play22:00

have a singular model which really can

play22:02

encapsulate all the knowledge across all

play22:04

these world so one joke we like to say

play22:05

is you can run CHT in the video model

play22:11

eventually and now let TR a bit about

play22:13

failure cases so of course Sora has a

play22:16

long way to go this is

play22:21

really Sora has a really hard time with

play22:23

certain kinds of physical interactions

play22:24

still today that we would think as being

play22:26

very simple so like share object in Sor

play22:30

mind even simpler kinds of physics than

play22:34

this if you drop a glass and shatter if

play22:35

you try to do a sample like that s

play22:37

will'll get it wrong almost time so it

play22:39

really has a long way to go and

play22:41

understanding very basic things that we

play22:43

take for granted so we're by no means

play22:45

anywhere near the end of this yet and to

play22:48

wrap up we have a bunch of samples here

play22:50

and we go to questions I think overall

play22:52

we're really excited about where this

play22:54

Paradigm is

play22:57

going

play23:07

we don't know

play23:09

next to extend

play23:12

it so we really view this as being like

play23:14

the gpt1 of video and we think this

play23:18

technology is going to get a lot better

play23:19

very soon there's some signs of life and

play23:22

some cool properties we're already

play23:23

seeing like I just went over um but

play23:25

we're really excited about this we think

play23:27

the things that people are going build

play23:28

on top of Ms like this are going to be

play23:30

mindblowing and really amazing and we

play23:32

can't wait to see what the world does

play23:34

with it so thanks a

play23:40

lot we have 10 minutes who goes

play23:44

first all right um so question about

play23:48

like understanding the agents or having

play23:50

the agent interact with each other with

play23:52

in the scene is that piece of

play23:54

information explicit already or is it

play23:56

just the P SS and then you have to run

play23:58

like a can now talk good question so all

play24:01

this is happening implicitly and so you

play24:03

know when we see these like Minecraft

play24:04

samples we don't have any notion of

play24:07

where it's actually modeling the player

play24:09

and where it's explicitly representing

play24:10

actions within the environment so you're

play24:12

right that if you wanted to be able to

play24:14

exactly describe what is happening or

play24:16

somehow read it off you would need some

play24:17

other system on top of Sora currently to

play24:19

be able to extract that information

play24:20

currently it's all implicit in the

play24:22

princi and emplo for that matter

play24:25

everything's implicit 3D is implicit

play24:27

everything is there's no

play24:28

anything so basically the things that

play24:30

you just describ right now is all the

play24:32

cool probabilities derived from model

play24:36

like

play24:37

after cool

play24:39

that's could you talk a little bit about

play24:42

the potential for fine tuning so if you

play24:45

have a very specific character or IP I

play24:49

know for the the wave one you used an

play24:51

input image for that how do you think

play24:53

that those plugins

play24:55

or built into the process yeah great

play24:58

question so this is something we're

play25:00

really interested in in general one

play25:02

piece of feedback we've gotten from

play25:03

talking with artists is that they just

play25:05

want them all to be ask controlable as

play25:06

possible to your point if they have a

play25:08

character they really love and that

play25:09

they've designed they would love to be

play25:10

able to use that across s Generations

play25:13

it's something that's actively on our

play25:14

mind you could certainly do some kind of

play25:17

fine tuning with the model if you had a

play25:18

specific data set of your content that

play25:21

you wanted to adapt the model for um we

play25:23

don't currently we're really in like a

play25:25

stage where we're just finding out

play25:27

exactly like what people want so so this

play25:28

kind of feedback is actually great for

play25:29

us so we don't have a clear road map for

play25:31

exactly that might be possible but in

play25:33

theory it's

play25:35

probably all right on the back you okay

play25:38

so language Transformers you're like

play25:41

pying autor regressively predicting this

play25:44

like sequential manner but in

play25:45

Transformers we do like this scanline

play25:47

order maybe we do like a snake through

play25:50

the spatial domain do you see this as a

play25:52

fundamental constraint Vision

play25:53

Transformers does it matter if you do

play25:56

does the order at which you predict

play25:58

token station

play26:00

matter yeah good question in this case

play26:03

we're actually using diffusion so it's

play26:05

not an auto regressive Transformer in

play26:07

the same way that language models are

play26:09

but we're Den noising the videos that we

play26:11

generate so we start from a video that's

play26:13

entire noise and we iteratively run our

play26:17

model to remove the noise and when you

play26:19

do that enough times you remove all the

play26:21

noise and you end up with a sample and

play26:23

so we actually don't have this like scan

play26:26

line order for example because you can

play26:28

do the denoising across many SpaceTime

play26:32

Patches at the same time and for the

play26:34

most part we actually just do it across

play26:36

the entire video at the same time we

play26:38

also have a way and we get into this a

play26:40

bit in that technical report that if you

play26:42

want to you could first generate a

play26:44

shorter video and then extend it so

play26:46

that's also an option but it can be used

play26:48

in either way either you can generate

play26:49

the video all at once or you can

play26:51

generate a shorter video and extended if

play26:53

you

play26:56

like yeah so the internet Innovation was

play26:59

mostly driven by BN do you feel a need

play27:03

to pay that adult industry

play27:10

back I feel no need also

play27:15

yeah all

play27:21

right do you generate da at 30 Fram

play27:24

second or do you like frames frame

play27:27

generation at

play27:28

that all the four way slower

play27:31

than we generate 30

play27:35

FPS okay have you tried like colliding

play27:39

cars or like rotations and things like

play27:41

that to see if the image generation F

play27:45

fits into like a physical model world

play27:47

that

play27:50

OBS we've tried a few examples like that

play27:52

I'd say rotations generally tend to be

play27:55

pretty reasonable it's by no means

play27:57

perfect I've seen it couple samples from

play27:59

Sora of colliding cars I don't think

play28:01

it's quite got three laws down

play28:08

yet so what are the IND Ed that you

play28:12

trying to fix right now with Sora that

play28:17

your

play28:18

so the engagement with people external

play28:22

right now is mainly focused on artists

play28:24

and how they would use it and what

play28:26

feedback they have for being able to to

play28:28

use it and people red teamers on safety

play28:31

so that's really the two types of

play28:33

feedback that we're looking for right

play28:34

now and as Bill mentioned a really

play28:36

valuable piece of feedback we getting

play28:37

from artists the type of control they

play28:39

want for example artists often want

play28:41

control of the camera and the path of

play28:43

the camera case also and then on the

play28:46

safety concerns it's about we want to

play28:48

make sure that if we were to give wider

play28:50

access to this that it would be

play28:52

responsible and safe and there are lots

play28:53

of potential misuses for it and

play28:55

disinformation there many concerns Focus

play29:00

possible to make videos that a user

play29:02

could actually interact with it like

play29:04

through V or something so let's say like

play29:05

video is playing halfway through I stop

play29:07

it I change a few things around with

play29:09

video just like Chris would I be able to

play29:11

rest of the video incorporate those

play29:13

changes it's a great idea right now Sora

play29:15

is still pretty slow from the latency

play29:17

perspective what we generally said

play29:19

publicly is so it depends a lot on the

play29:21

exact parameters of the generation

play29:22

duration resolution if you're cranking

play29:25

out this thing it's going to take at

play29:26

least a couple minutes and so we're

play29:28

still I'd say a way is off from the kind

play29:30

of experience you're describing but I

play29:32

think it' be really cool

play29:34

thanks what were your stated goals in

play29:37

building this first version and what

play29:39

were some problems that you had along

play29:41

the way that you learned

play29:43

from I'd say the overarching goal was

play29:46

really always to get to 1080p at least

play29:49

30 seconds from like the early days of

play29:51

the project so we felt like video

play29:53

generation was stuck in the Rut of this

play29:56

4 second like J generation

play29:58

and so that was really the key focus of

play30:00

the team throughout the project along

play30:02

the way I think we discovered how

play30:04

painful it is to work with video data

play30:06

it's a lot of pixels in these videos and

play30:08

it's a lot of just very detailed boring

play30:12

engineering work that needs to get done

play30:14

to really make these systems work and I

play30:17

I think we knew going into it that it

play30:19

would involve a lot of elbow grease in

play30:20

that regard but yeah it certainly took

play30:22

some time so I don't know any other

play30:24

findings along the way yeah I mean we

play30:28

tried really hard to keep the method

play30:30

really simple and that is sometimes

play30:32

easier said than done but I think that

play30:34

was a big focus of just let's do the

play30:36

simplest thing we possibly can and

play30:39

really scale it and do the scaling

play30:43

prop did you do the prom and see the

play30:46

output it's not good enough then you go

play30:48

TR again do the same prom and then it's

play30:51

there that's first video then you do

play30:54

more than training than the new prom and

play30:58

new video is that the process you use in

play31:00

this reling the

play31:02

videos that's a good question evaluation

play31:05

is challenging for videos we use a

play31:07

combination of things one is your actual

play31:10

loss and low loss is correlated with

play31:12

models that are better so that can help

play31:14

another is you can evaluate the quality

play31:17

of individual frames using image metrics

play31:19

so we do use standard image metrics to

play31:21

evaluate the quality frames and then we

play31:23

also did spend quite a lot of time

play31:26

generating samples and looking at them

play31:28

ourselves although in that case it's

play31:29

important that you do it across a lot of

play31:31

samples and not just individual prompts

play31:34

because sometimes this process is noisy

play31:36

so you might randomly get a good sample

play31:38

and think that you Improvement so this

play31:40

would be like you compare Lots ofrs in

play31:42

the

play31:48

outputs uh we can't comment on that one

play31:51

last

play31:54

question thanks for a great talk so my

play31:56

question is on the training data so how

play31:58

much training data do you estimate that

play32:00

is required for us to get to AGI and do

play32:02

you think we have enough data on the

play32:06

internet yeah that's a good question I

play32:08

think we have enough dat it to get to

play32:10

AGI and I also think people always come

play32:14

up with creative ways to improve things

play32:16

and when we hit limitations we find

play32:19

creative ways to improve regardless so I

play32:23

think that whatever data we have will be

play32:25

enough to get the AG wonderful okay

play32:28

that's to AI thank

play32:31

[Applause]

play32:34

you

Rate This

5.0 / 5 (0 votes)

Related Tags
AI InnovationVideo GenerationContent CreationSora PlatformAGI Pathway3D ConsistencyArtistic CollaborationTech AdvancementVisual IntelligenceDigital Art