Google's LUMIERE AI Video Generation Has Everyone Stunned | Better than RunWay ML?

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
24 Jan 202421:06

Summary

TLDRGoogle has unveiled Lumiere, an advanced AI text-to-video model that translates text prompts into high-quality, coherent videos. Lumiere goes beyond text-to-video to animate images, create video-in-painting effects, and generate consistent shots using a spacetime architecture. Research suggests these models may be developing an internal representation of 3D scenes despite only seeing 2D images. Lumiere outperforms other models like Pika and Gen2 in metrics like text alignment and video quality. This technology could empower everyday creators to make Hollywood-style films with AI. The rapid improvements suggest this is an exciting time for aspiring AI cinematographers.

Takeaways

  • 😲 Google unveils new AI model Lumiere for generating realistic videos from text and images
  • 🎥 Lumiere allows text-to-video, image-to-video, stylized video generation, video animation, and more
  • 🌄 Researchers optimized Lumiere for temporal consistency across video frames
  • 🔍 Study investigates whether diffusion models learn deep representations or just surface statistics
  • 💡 Evidence suggests models develop some innate sense of 3D geometry and scene composition
  • 😎 Lumiere outperforms state-of-the-art video AI models ImageN, Gen2, and others
  • ⏰ Video quality has drastically improved over the past year thanks to AI advancements
  • 🎞 Everyday creators may soon produce Hollywood-style films using AI visuals and voices
  • 🚀 'World models' that simulate environments could be the next evolution of video AI
  • 📈 Expect rapid improvements in coherence and realism of AI-generated video in the near future

Q & A

  • What is Lumiere and what are its capabilities?

    -Lumiere is Google's latest AI text-to-video model. It can translate text prompts into video, animate existing images, create video in the style of an image/painting, and fill in missing sections of an image with video.

  • How does Lumiere compare to other text-to-video AI models?

    -According to the paper, Lumiere performs better than other state-of-the-art models like Pika and Gen2 in terms of temporal consistency, text alignment to prompt, and user preference.

  • How does the SpaceTime architecture used in Lumiere differ from previous approaches?

    -Whereas previous models generate frames sequentially, Lumiere's SpaceTime architecture generates the full video duration at once, improving global temporal consistency.

  • What evidence suggests AI models may be learning more than surface statistics?

    -The Beyond Surface Statistics paper showed Lumiere develops internal representations related to scene geometry and depth despite only seeing 2D images during training.

  • What are some potential applications of Lumiere?

    -Lumiere could be used for video stylization, creating CGI and special effects, generating storyboards, converting image collections into video, and more.

  • How might Lumiere impact filmmaking and content creation?

    -Lumiere may allow everyday creators to produce Hollywood-quality video using AI, opening up new genres of AI-assisted filmmaking.

  • What are general world models, and why are they the next step for AI?

    -General world models are AI systems that simulate entire environments and physics. This will allow for more realistic video and better robotics through a deeper understanding.

  • How has AI-generated video quality improved over the past year?

    -Video quality has improved drastically, with more temporally consistent objects and scenes. Compare today's smooth output to distorted legacy examples from a year ago.

  • What role might creative professionals play in this new era of AI-generated content?

    -Humans are still needed to provide creative vision and high-level direction. AI will assist with the technical execution, amplifying human creativity.

  • What developments are coming next for AI-generated video and imagery?

    -Higher video resolution, longer duration, more photorealism, and tools to easily control and direct the AI are all likely next steps.

Outlines

00:00

😯Introducing Lumiere, Google's new text-to-video AI

This paragraph introduces Lumiere, Google's new text-to-video AI model. It highlights Lumiere's core capability of generating video from text prompts and also allowing animation of existing images. Examples are shown of text-to-video, image-to-video, stylized video generation, video stylization, cinemagraphs, and video in painting. The consistency of the generated videos is noted. An overview of the Spacetime diffusion model used by Lumiere is provided.

05:00

🎥Lumiere enables easy video production and new creative possibilities

This paragraph discusses the potential of Lumiere to transform video production, allowing everyday people to create Hollywood-style videos with AI-generated footage and voices. It also covers some background research into how neural nets generate images, suggesting they may be learning more than just surface statistics.

10:01

👀Debates continue about how generative AI models create content

This paragraph analyzes debates within the AI research community about whether generative models like Lumiere simply memorize pixel correlations between inputs and outputs or if they develop some deeper understanding. An experiment showing a model recreating 3D scene aspects without being explicitly trained to do so is covered.

15:03

😎Lumiere advances the state of the art in consistency and quality

This paragraph evaluates Lumiere's video quality and consistency compared to other leading models like Imagen, Gen-2, and anime-diffusion. Quantitative analysis and side-by-side examples highlight Lumiere's superior performance across text-to-video and image-to-video tasks.

20:05

🤖Simulating entire worlds may be the next frontier

The final paragraph examines the idea of developing general world models to move beyond isolated video clips toward AI systems that simulate fuller environments, enabling the generation of more realistic and complex videos. The rapid recent progress of AI video generation capabilities is also highlighted.

Mindmap

The video is abnormal, and we are working hard to fix it.
Please replace the link and try again.

Keywords

💡Lumiere

Lumiere is described as Google's latest AI tool, a text-to-video AI model that translates text into video. This technology signifies a leap in AI capabilities, allowing for the animation of existing images, video inpainting, and the creation of specific animation sections within images. The script illustrates Lumiere's ability to produce videos from text prompts, such as animating flags, dogs with headphones, and even astronauts on Mars, showcasing its potential to revolutionize video production and creativity.

💡SpaceTime diffusion model

The SpaceTime diffusion model underpins Lumiere's video generation capabilities. It's a sophisticated AI architecture that enables the creation of realistic, coherent motion in videos. Unlike previous models that process frames sequentially, this model considers the entire temporal duration of the video from the outset, ensuring consistency and continuity across frames. This concept is pivotal for understanding Lumiere's advancement over existing AI video technologies.

💡Temporal consistency

Temporal consistency refers to the uniformity and steadiness of visual elements throughout the duration of a video. In the context of Lumiere, it highlights the AI's ability to maintain consistent appearance and behavior of subjects across different frames, an essential factor for creating realistic videos. This contrasts with earlier models where objects could change unexpectedly from one frame to the next, detracting from realism.

💡Image to video

Image to video conversion is a feature of Lumiere that animates still images into dynamic video sequences. Examples from the script include animating a bear walking in New York or Bigfoot traversing the woods. This capability demonstrates how AI can breathe life into static images, creating engaging and visually captivating content from simple pictures.

💡Video inpainting

Video inpainting is a process where missing parts of a video frame are filled in or reconstructed using AI. Lumiere's application of video inpainting showcases its ability to intuitively guess and replicate what might occupy these voids, such as adding green leaves to a scene based on the context. This highlights the AI's advanced understanding and predictive capability regarding video content.

💡Stylized generation

Stylized generation refers to Lumiere's ability to apply specific styles or artistic effects to videos, based on a target image or painting. This allows for the creation of videos not just in realistic styles but also in artistic or abstract aesthetics, demonstrating the flexibility and creative potential of Lumiere for various applications in art and entertainment.

💡Cinemagraphs

Cinemagraphs are still photographs in which a minor and repeated movement occurs. Lumiere's mention of creating cinemagraphs, such as animating smoke from a train, illustrates its nuanced understanding of motion and ability to apply it selectively within a scene to create visually striking effects.

💡General World models

General World models, as discussed in the context of AI development, refer to sophisticated AI systems that build internal representations of environments and simulate future events within those environments. This concept is crucial for understanding the direction in which AI video generation technology, including Lumiere and other platforms like Runway ML, is heading, aiming for more realistic and dynamically coherent video content.

💡Neural Networks

Neural Networks are a core component of AI technologies like Lumiere, simulating the way human brains operate to make decisions and predictions. The script discusses how neural networks are used in AI models to predict and generate video content, even inferring 3D geometries from 2D images, which underscores the complexity and advanced capabilities of current AI video generation tools.

💡AI cinematography

AI cinematography emerges from the script as a concept where AI technologies, like Lumiere, are utilized to create films or videos. It represents a shift in content creation, where AI's role transitions from a tool to an active participant in the creative process, enabling users to produce high-quality, cinematic content without the need for extensive resources or traditional filmmaking techniques.

Highlights

Lumiere is a new AI tool from Google that generates realistic videos from text prompts.

Lumiere allows animating images and creating specific animation sections within images.

Lumiere produces smooth, temporally consistent videos compared to other models.

Lumiere performs image-to-video and text-to-video generation better than leading models.

Lumiere uses a spacetime diffusion model to generate the entire video at once.

Other models struggle with temporal consistency across frames.

Lumiere seems to develop an internal 3D representation despite only seeing 2D images.

Debate continues on whether AI models learn surface statistics or deeper understanding.

RunwayML is working on general world models to improve video generation.

World models simulate environments to create more realistic imagery and motion.

AI-generated video has improved rapidly, from incoherent to lifelike in 1-2 years.

AI tools may enable easy Hollywood-quality movie production at home soon.

AI could help creative people make movies without financial limitations.

Next steps are AI-assisted world building and story generation.

Now is an exciting time to explore AI-generated film as quality quickly improves.

Transcripts

play00:00

and just like that out of the blue

play00:02

Google drops its latest AI tool Lumiere

play00:05

Lumiere is at its core a text to video

play00:08

AI model you type in text and the AI

play00:11

neural Nets translate that into video

play00:15

but as you'll see Lumiere is a lot more

play00:17

than just text to

play00:19

video it allows you to animate existing

play00:22

images creating video and the style of

play00:25

that image or painting as well as things

play00:27

like video in painting and creating

play00:30

specific animation sections within

play00:32

images so let's look at what it can do

play00:35

the science behind it Google published a

play00:38

paper talking about what they improved

play00:40

and I'll also show you why the

play00:42

artificial AI brains that generate these

play00:46

videos are much more weird than you can

play00:50

imagine so this is lumere from Google

play00:52

research A Spacetime diffusion model for

play00:55

realistic video generation we'll cover

play00:57

SpaceTime diffusion model a bit later

play00:59

but right now now this is what they're

play01:01

unveiling so first of all there's text

play01:03

to video this is the video that are

play01:04

produced by various prompts like US flag

play01:07

waving on massive Sunrise clouds funny

play01:09

cute pug dog feeling good listening to

play01:11

music with big headphones and Swinging

play01:13

head Etc snowboarding Jack Russell

play01:16

Terrier so I got to say these are

play01:18

looking pretty good if these are good

play01:19

representations of the sort of style

play01:21

that we can get from this model this

play01:23

would be very interesting so for example

play01:25

take a look at this one astronaut on the

play01:27

planet Mars making a detour around his

play01:30

base this is looking very consistent

play01:33

this looks like a tablet this looks like

play01:35

a medicine tablet of some sort floating

play01:37

in space but I got to say everything is

play01:39

looking very consistent which is what

play01:42

they're promising in their research it

play01:43

looks like they found a way to create a

play01:45

more consistent shot across different

play01:47

frames temporal consistency as they call

play01:50

it here's image to video so as you can

play01:52

see that this is nightmarish but that's

play01:54

that's the scary looking one but other

play01:56

than that everything else is looking

play01:58

really good so they're taking IM images

play02:00

and turning them into animations little

play02:03

animations of a bear walking in New York

play02:05

for example Bigfoot walking through the

play02:08

woods so these were started with an

play02:10

image that then gets animated these are

play02:13

looking pretty good here are the Pillars

play02:15

of Creation animated right there that's

play02:17

uh pretty neat kind of a 3D structure

play02:20

they're showing styliz generation so

play02:22

using a Target image to kind of make

play02:24

something colorful or animated take a

play02:26

look at this elephant right here one

play02:28

thing that jumps out at me is it is very

play02:30

consistent there's no weirdness going on

play02:33

in a second we'll take a look at other

play02:34

leading AI models that generate video

play02:37

and I got to say this one is probably

play02:39

the smoothest looking one here's another

play02:41

one so as you can see here here's the

play02:43

style reference image so they want this

play02:45

style and then they say a bear twirling

play02:47

with delight for example right so then

play02:49

it creates a bear twirling with delight

play02:51

or a dolin leaping out of the water in

play02:53

the style of this image here's the same

play02:55

or similar prompts with this as the

play02:58

style reference now this as a the style

play03:00

reference I got to say it captures the

play03:02

style pretty well here's kind of that

play03:04

neon phosphorus glowing thing and they

play03:07

introduce A Spacetime unit architecture

play03:09

and we'll look at that towards the end

play03:10

of the video but basically it sounds

play03:12

like it creates sort of the idea of the

play03:14

entire video at once so while other

play03:17

models it seems like kind of go frame by

play03:19

frame this one has sort of an idea of

play03:21

what the whole thing is going to look

play03:22

like at the very beginning and there's a

play03:24

video stylization so here's a lady

play03:26

running this is the source video and the

play03:28

various craziness that you can make her

play03:30

into the same thing with a dog and a car

play03:34

and a bear cinemagraphs is the ability

play03:36

to animate only certain portions of the

play03:38

image like the smoke coming out of this

play03:40

train this is something that Runway ml I

play03:42

believe recently released and looks like

play03:44

Google is hot on their heels creating

play03:46

basically the same ability then we have

play03:48

video and painting So if a portion of an

play03:50

image is missing you're able to use AI

play03:52

to sort of guess at what that would look

play03:54

like I got to say so here where the hand

play03:55

comes in that is very interesting cuz

play03:57

that seems kind of advanced cuz notice

play03:59

in the beginning he throws the Green

play04:01

Leaf in the missing portion of the image

play04:03

and then you see him coming back to the

play04:06

image that we can see throwing a green

play04:07

leaf or two so it makes the assumption

play04:09

that hey the things there will also be

play04:12

green leaves interestingly enough though

play04:14

I do feel like I can spot a mistake here

play04:16

the leaves that are already on there are

play04:18

fresh looking as opposed to the cooked

play04:20

ones like they are on this side so it

play04:22

knows to put in the green leaves as the

play04:24

guy is throwing them for them to be

play04:25

fresh because it matches the fresh

play04:27

leaves here but it misses the point that

play04:28

hey these are cooked leaves and these

play04:30

are fresh but still it's very impressive

play04:33

that it's able to sort of to sort of

play04:34

guess at what's happening in that moment

play04:37

and this is where if you've been

play04:38

following some of the latest AI research

play04:40

this is where these neural Nets get a

play04:41

little bit weird well again come back to

play04:43

that at the end but how they are able to

play04:46

predict certain things like what happens

play04:47

here for example like no one codes it to

play04:50

know that this is probably a cake of

play04:52

some sort nobody tells it what this

play04:54

thing is it guesses from clues that it

play04:57

sees on screen but how does that is

play05:00

really really weird let's just say that

play05:02

this is pretty impressive so here we're

play05:04

able to change the clothes that the

play05:06

person is wearing throughout these shots

play05:07

while you know notice the hat and the

play05:09

face they kind of remain consistent

play05:10

across all the shots whereas the dress

play05:13

is changed based on a text prompt as you

play05:15

watch this think about where video

play05:18

production for movies and serial TV

play05:20

shows Etc where that's going to be in 5

play05:23

to 10 years will something like this

play05:24

allow everyday people sitting at home to

play05:27

create stunning Hollywood style movies

play05:29

with whatever characters they want

play05:31

whatever settings they want with'

play05:32

generated video and AI voices we can

play05:35

create a movie starting Hugh Hefner as a

play05:36

chicken for example so really fast this

play05:38

is another study called Beyond surface

play05:40

statistics out of hardw so this has

play05:41

nothing to do with the Google project

play05:44

that we're looking at but this paper

play05:45

tries to answer the question of how do

play05:48

these models how do they create images

play05:50

how do they create videos as you can see

play05:52

here it says these models are capable of

play05:54

synthesizing high quality images but it

play05:56

remains a mystery how these networks

play05:58

transform let's say the phrase car in

play06:00

the street into a picture of a car in a

play06:02

street so in other words when we type in

play06:04

this when a human person says draw a

play06:06

picture of a car in a street or a video

play06:08

of a car in a street how does that thing

play06:10

do it how does it translate that into a

play06:12

picture do they simply memorize

play06:14

superficial correlations between pixel

play06:16

values and words or are they learning

play06:18

something deeper such as the underlying

play06:20

model of objects such as cars roads and

play06:23

how they are typically positioned and

play06:25

there's a bit of a argument going on in

play06:27

the scientific Community about this so

play06:29

some AI scientists say all it is is just

play06:32

sort of surface level statistics they're

play06:34

just memorizing where these little

play06:36

pixels go and they're able to kind of

play06:38

reproduce certain images Etc and some

play06:40

people say well no there's something

play06:42

deeper going on here something new and

play06:44

surprising that these AI models are

play06:46

doing so what they did is they created a

play06:48

model that was fed nothing but 2D images

play06:51

so images of cars and people and ships

play06:54

Etc but that model it wasn't taught

play06:57

anything about depth like depth of field

play07:00

like where the foreground of an image is

play07:01

or where the background of an image is

play07:03

it wasn't taught about what the focus of

play07:05

the image is what a car is ETC and what

play07:07

they found is so here's kind of like the

play07:09

decoded image so this is kind of how it

play07:11

makes it from step one to finally step

play07:14

15 where as you can see you can see this

play07:16

is a car so a human being would be able

play07:18

to point at this and say that's a car

play07:20

what in the image is closest to you the

play07:23

person taking the image you say well

play07:24

probably this wheel is the closest right

play07:26

this is the the kind of the foreground

play07:28

this is the main object and that's kind

play07:29

of the background that's far far away

play07:31

and this is close right but the reason

play07:33

that you are able to look at this image

play07:35

and know that is because you've seen

play07:37

these objects in the real world in the

play07:39

3D world you can probably imagine how

play07:41

this image would look if you're standing

play07:43

off the side here looking at it from

play07:45

this direction this AI model that made

play07:47

this has no idea about any of that all

play07:49

it's seeing is a bunch of these 2D

play07:51

images just pixels arranged in a screen

play07:53

and yet when we dive into try to

play07:55

understand how it's building these

play07:57

images from scratch this is what we

play07:59

start to notice so early on when it's

play08:01

building this image this is kind of what

play08:04

the the depth of the image looks like so

play08:07

very early on it knows that sort of this

play08:10

thing is in the foreground it's closer

play08:13

to us and this right here the blue

play08:15

that's the background it's far from us

play08:17

now looking at this image you can't

play08:19

possibly tell what this is going to be

play08:21

you can't tell what this is going to be

play08:22

till much much later maybe here we can

play08:25

kind of begin to start seeing some of

play08:27

the lines that are in here but that's

play08:28

about it you you see like the wheels and

play08:31

maybe you could guess of what that is

play08:33

but here in the beginning you have no

play08:34

idea and yet the model knows that

play08:35

something right here is in the

play08:36

foreground something's in the background

play08:38

and towards the end it knows that this

play08:40

is closer this is close and this is far

play08:43

this is Salient object meaning like what

play08:44

is the focus what is the main object so

play08:46

it knows that the main object is here it

play08:48

doesn't know what a car is it doesn't

play08:49

know what an object is it just knows

play08:51

like this is the the focus of the image

play08:53

again only towards much later do we

play08:55

realize that yes in fact this is the car

play08:57

and so this is the conclusion of the

play08:58

paper our experiments provide evidence

play09:00

that stable diffusion model so this is

play09:02

an image generating model AI although

play09:05

solely trained on two-dimensional images

play09:07

contain an internal linear

play09:09

representation related to scene geometry

play09:11

so in other words after seeing thousands

play09:14

or millions of 2D images inside its

play09:17

neural network inside of its brain it

play09:19

seems like and again this is a lot of

play09:22

people sort of dispute this but some of

play09:24

these research makes it seem like it's

play09:26

developing its neural net that allows it

play09:29

to create a 3D representation of that

play09:32

image even though it's never been taught

play09:34

what 3D means it uncovers a salent

play09:37

object or sort of that main Center

play09:39

object that it needs to focus on versus

play09:41

the background of the image as well as

play09:43

information related to relative depth

play09:45

and these representations emerge early

play09:47

so before it starts painting the colors

play09:50

or the little shapes or the the wheels

play09:52

and the Shadows it first starts thinking

play09:54

about the 3D space on which it's going

play09:56

to start painting that image and here

play09:58

they say these results add a Nuance to

play10:00

the ongoing debates and there are a lot

play10:02

of ongoing debates about this about

play10:04

whether generative models so these AI

play10:06

models can learn more than just surface

play10:08

statistics in other words is there some

play10:11

sort of understanding that's going on

play10:13

maybe not like human understanding but

play10:15

is it just statistics or is there

play10:18

something deeper that's happening and

play10:21

this is Runway ml so this is the other

play10:23

one of the leading sort of text 2 image

play10:26

AI models and you might have seen the

play10:29

images so as you can see here this is

play10:31

what they're offering people have made

play10:33

full movies maybe not hour long but

play10:35

maybe 10 minutes 20 minute movies that

play10:38

are entirely generated by AI so as you

play10:41

can see here it's it's similar to what

play10:44

Google is offering although I got to say

play10:47

after looking at Google's work and then

play10:49

this one Google's does seem just a

play10:51

little bit more consistent I would say

play10:53

there seems to be a little bit less

play10:54

shifting and and shapes going on it's

play10:56

just a little bit more consistent across

play10:58

time time and they have a lot of the

play11:00

same thing like this stylization here

play11:01

from a reference video to this image

play11:04

that's like the style reference but the

play11:06

interesting thing here is this is in the

play11:08

last few months looks like December 2023

play11:11

Runway nml introduced something they

play11:13

call General World models and they're

play11:15

saying we believe the next major

play11:17

advancement in AI will come from systems

play11:19

that understand the visual world and its

play11:21

Dynamics they're starting a long-term

play11:23

research effort around what they call

play11:24

General World models so their whole idea

play11:27

is that instead of the video AI models

play11:30

creating little Clips here and there

play11:32

with little isolated subjects and

play11:34

movements that a better approach would

play11:36

be to actually use the neural networks

play11:39

and them building some sort of a world

play11:41

model to understand the images they're

play11:43

making and to actually utilize that to

play11:46

have it almost create like a little

play11:47

world so for example if you're creating

play11:49

a clip with multiple characters talking

play11:52

then the AI model would actually almost

play11:54

simulate that entire world with the with

play11:56

the rooms and the people and then the

play11:58

people would talk talk to each other and

play12:00

it would just take that clip but it

play12:01

would basically create much more than

play12:03

just a clip like if a bird is flying

play12:05

across the sky it would be simulating

play12:07

the wind and the physics and all that

play12:10

stuff to try to capture the movement of

play12:12

that bird to create realistic images and

play12:14

video so they're saying a world model is

play12:16

an AI system that builds an internal

play12:18

representation of an environment and it

play12:20

uses it to simulate future events within

play12:22

that environment so for example for Gen

play12:24

2 which is their model their video model

play12:27

to generate realistic short video it has

play12:29

developed some understanding of physics

play12:32

and motion card still very limited

play12:34

struggling of complex camera controls or

play12:36

object motions amongst other things but

play12:39

they believe and a lot of other

play12:40

researchers as well that this is sort of

play12:42

the next step for us to get better at

play12:45

creating video at teaching robots how to

play12:47

behave in the physical world like for

play12:49

example the nvidia's foundation agent

play12:51

then we need to create bigger models

play12:53

that simulate entire worlds and then

play12:55

from those worlds they pull out what we

play12:57

need whether that's an image or text or

play12:59

a robot's ability to open doors and pick

play13:02

up objects all right but now back to

play13:04

Lumiere A Spacetime diffusion model for

play13:06

video generation so here they have a

play13:08

number of examples for the text to video

play13:11

of image to video stylized generation

play13:14

Etc and so in lumier they're trying to

play13:16

build this text video diffusion model

play13:19

that can create videos that portray

play13:20

realistic diverse and coherent motion a

play13:23

pivotal challenge in video synthesis and

play13:25

so the new thing that they introduces

play13:26

the SpaceTime unet archit tecture that

play13:29

generates entire temporal duration of

play13:31

the video at once so in other words it

play13:33

sort of thinks through how the entire

play13:36

video going to look like in the

play13:37

beginning as opposed to existing video

play13:39

models so other video models which

play13:41

synthesize distant key frames followed

play13:43

by temporal super solution basically

play13:45

meaning they do it one at a time so they

play13:47

start with one and then create the

play13:49

others and they're saying that makes

play13:50

Global temporal consistency difficult

play13:52

meaning that the object as as you watch

play13:54

a video of it right it looks a certain

play13:55

way on the first second of the video but

play13:58

by second five is just completely

play14:00

different and so here basically they're

play14:01

comparing these two videos so imagin and

play14:03

rs so The Lumiere model as you can see

play14:06

here here sample a few clips and they're

play14:08

looking at the XT slice so the XT slice

play14:12

you can basically think of that as so

play14:14

for example in stocks you have you know

play14:16

the price of stock over time right so it

play14:18

kind of goes like this here the x is the

play14:22

spatial Dimension so where certain

play14:24

things are in space on the image versus

play14:26

T temporal the time so the X here is

play14:29

basically where we might be looking at

play14:30

the width of the image for example of

play14:33

any image in time and T the temporal is

play14:36

like how consistent is across time so as

play14:37

you can see hit this green line so we're

play14:38

just looking at this thing across the

play14:40

entire image and this is what that looks

play14:42

like so as you can see here this is

play14:44

going pretty well and then it kind of

play14:46

messes up and it kind of gets crazy here

play14:48

and then kind of goes back to doing okay

play14:50

whereas in Lumiere it's pretty pretty

play14:54

good I mean maybe some funkiness right

play14:56

there in one one frame but it's pretty

play14:58

good same thing here I mean this is as

play15:00

you can see here pretty good maybe you

play15:03

can say that there's a little bit of

play15:04

funkiness here but overall it's very

play15:06

good whereas in this image and video I

play15:09

mean as you can see here there's kind of

play15:11

like a lot of nonsense that's happening

play15:13

right and so here you can see like you

play15:15

can't tell how many legs it has if it's

play15:17

missing a leg Etc whereas in The Lumiere

play15:20

I mean I feel like the you know you can

play15:22

see each of the legs pretty distinctly

play15:25

and their position and it's remains

play15:27

consistent across time or at least

play15:29

consistently easy to see where they are

play15:31

but I got to say I can't wait to get my

play15:33

hands on it it looks like as of right

play15:35

now I don't see a way to access it this

play15:37

is just sort of a preview but hopefully

play15:40

they will open up for testing soon and

play15:42

we'll be able to get our hands on it and

play15:43

check it out and here interestingly

play15:45

enough they actually compare how well

play15:47

their performs against the other

play15:49

state-of-the-art models in the in the

play15:51

industry so the two that I'm familiar

play15:53

with is Pika and genan 2 those are the

play15:56

two that I've used and they're saying

play15:57

that their video video is preferred by

play15:59

users in both text to video and image to

play16:02

video generation so blue is theirs and

play16:05

the Baseline is the orange one so it

play16:07

seems like there are pretty big

play16:09

differences in every single one this

play16:11

seems like video quality I mean it beats

play16:13

out every single other one of these

play16:15

which which I believe this text

play16:17

alignment which here means probably how

play16:19

well the image how true it is to The

play16:21

Prompt right so if you type in a prompt

play16:24

how accurately it represents it so it

play16:26

looks like maybe image is the closest

play16:28

one but it beats out most of the other

play16:30

ones by quite a bit and then video

play16:32

quality of image to video it seems like

play16:34

it beats them out as well with Gen 2

play16:37

probably being the next best one and

play16:39

here they provide a side-by-side

play16:41

comparison so for example the first

play16:42

prompt is a sheep to the right of a wine

play16:45

glass so this is Pika which which not

play16:48

great CU there's no wine glass here's

play16:50

Gen 2 consistently putting it on the

play16:53

left anime diff which just has two

play16:55

glasses and maybe a reflection of a

play16:57

sheep image and video same thing so the

play16:59

glasses on the left zero scope no

play17:02

glasses that I can see although they

play17:03

have sheep and of course R so the Lumi

play17:06

the Google one is it seems like a nail

play17:08

it in every single one the glass is on

play17:10

the right although I got to say Gen 2 is

play17:12

is great although it confused the left

play17:14

and right but other than that I mean

play17:16

same if image and video actually

play17:18

although I feel like Gen 2 the quality

play17:20

is much better of the sheep cuz that's

play17:22

you know that's a good-looking sheep I

play17:24

should probably rephrase that that's a

play17:27

well rendered sheep how about that

play17:29

versus imagin I mean that's a weird

play17:32

looking thing there that could almost be

play17:33

a horse or a cow if you just look at the

play17:35

face and Google is again excellent

play17:38

here's teddy bear skating in Time Square

play17:41

this is Google this is imag again

play17:43

weirdness happening there and that's gen

play17:45

two again pretty good but I mean the the

play17:47

thing is facing away although here I

play17:49

just noticed so they they took skating

play17:51

to mean ice skates whereas here it looks

play17:53

like these are roller skates skateboard

play17:55

Etc and so it looks like in the study

play17:57

they just showed you two to things they

play17:59

say do you like the left or the right

play18:00

more based on motion and better quality

play18:03

well I got to say if you're an aspiring

play18:05

AI cinematographer then this is really

play18:08

good news consistent coherent images

play18:11

that are able to create near lifelike

play18:15

scenes at this point I mean I'm sure

play18:17

there's other people that'll complain

play18:18

about stuff but you got to realize how

play18:21

quickly the stuff is progressing just to

play18:23

give you an idea this is about a year

play18:26

ago or so this is what a I generated

play18:29

video looked like so can you tell that

play18:33

is improved just a little bit that's

play18:36

about a year I'm not sure exactly when

play18:37

this was done but I'm going to say a

play18:39

year year and a half ago and I mean this

play18:41

thing gets nightmarish so when I'm

play18:44

talking about weird blocky shapes things

play18:47

not being consistent across scenes like

play18:51

what are we even looking at

play18:53

here is this a mouth is this a building

play18:56

and here's kind of uh something from

play18:58

about 4 months ago from Pika Labs so as

play19:01

you can see here it's much better it's

play19:04

much more consistent right as you can

play19:06

see here humans again maybe they look a

play19:09

little bit weird but it's better it can

play19:11

put you in the moment if you're telling

play19:13

a story that's not necessarily about

play19:15

everything looking realistic something

play19:18

like this can be created pretty easily

play19:20

and since it's new it's novel people

play19:23

might be this might be a whole new

play19:25

movement a new genre of film making

play19:28

that's new exciting and never before

play19:30

seen and most importantly it's easy to

play19:33

create with a you know at home with a

play19:36

few AI tools and anybody out there with

play19:39

creative abilities with creative talent

play19:41

to tell the stories that they have in

play19:43

their mind without being limited

play19:46

financially by Capital they're going to

play19:49

be able to create AI voices they're

play19:51

going to be able to create AI footage

play19:54

maybe even have Chad GPT help them with

play19:56

some of the story writing and once more

play19:58

the sort of the next generation of

play20:00

things that we're seeing that people are

play20:01

working on is things like the similation

play20:04

where you create the characters and then

play20:06

you sort of let them loose in a world

play20:08

they get simulated with these they get

play20:10

sort of simulated so the stories kind of

play20:12

play out in the world and then you sort

play20:14

of pick and choose what to focus on

play20:17

which scenes and which characters you

play20:19

want to bring to the front so you

play20:21

basically act as the World Builder you

play20:24

build the worlds the characters the

play20:25

narratives and AI assists you in

play20:28

creating the visuals the voices Etc and

play20:31

you can be 100% in control of it or you

play20:34

can only control the things that you

play20:36

want and the AI generates the rest so to

play20:39

me this if you're interested in movie

play20:41

making and you like these sort of styles

play20:44

that by the way quickly will become much

play20:46

more realistic I would be really looking

play20:49

at this right now because right now is

play20:51

the time that it's sort of emerging into

play20:54

the world and getting really good and

play20:57

it's going to get better by next year

play20:59

it's going to be a lot

play21:01

better well my name is Wes rth and uh

play21:04

thank you for watching