Google's LUMIERE AI Video Generation Has Everyone Stunned | Better than RunWay ML?

AI Unleashed - The Coming Artificial Intelligence Revolution and Race to AGI
24 Jan 202421:06

Summary

TLDR谷歌最新推出的AI工具Lumiere,核心功能是将文本转换成视频。它不仅能够生成视频,还能对现有图像进行动画化,创造出具有特定风格的视频,例如“视频绘画”和在图像中创建特定动画部分。谷歌的研究论文揭示了其背后的科学原理,即时空扩散模型,该模型能够生成真实感强、多样化和连贯性强的视频。Lumiere在文本到视频、图像到视频、风格化生成等方面展示了其能力,与现有的视频模型相比,它在用户偏好和视频质量方面都表现出色。这项技术的发展预示着未来视频制作可能会变得更加容易和可访问,为电影制作和故事叙述开辟了新的可能性。

Takeaways

  • 🚀 Google发布了最新的AI工具Lumiere,它是一个文本到视频的AI模型,可以将文本转换为视频。
  • 🎨 Lumiere不仅能将文本转换为视频,还能对现有图像进行动画处理,创造出具有特定风格或绘画风格的视频。
  • 📈 Google的研究论文提到了他们对SpaceTime扩散模型的改进,这种模型能够生成逼真的视频。
  • 🤖 AI生成的视频在风格和动作上表现出了高度的一致性,这在以往的模型中是一个挑战。
  • 🌌 Lumiere能够将静态图像转换成动画,例如将一幅熊的图片转换成在纽约行走的动画。
  • 🎭 Lumiere通过使用目标图像来生成风格化的视频,例如让一只熊以某种风格旋转跳跃。
  • 📹 Lumiere引入了Spacetime单元架构,这种架构能够在一开始就构建整个视频的概念,而不是逐帧生成。
  • 🎨 Lumiere还包括视频风格化功能,可以改变视频的特定风格,例如只对视频中的某些部分进行动画处理。
  • 🧩 Lumiere还能够进行视频和绘画,即使图像中有一部分缺失,AI也能猜测补充缺失的部分。
  • 📈 通过与其他顶尖AI模型的比较,Lumiere在文本到视频和图像到视频生成方面被用户偏好。
  • ⏱️ Lumiere在视频生成中实现了更好的全局时间一致性,与逐帧生成的视频模型相比,它在视频的整个时间跨度上保持了一致性。

Q & A

  • Google最新推出的AI工具Lumiere的核心功能是什么?

    -Lumiere的核心功能是文本到视频的AI模型,用户输入文本后,AI神经网络将其翻译成视频。此外,它还能动画化现有图像,创建具有特定风格的视频,以及在图像中创建特定的动画部分。

  • Lumiere如何实现视频的一致性?

    -Lumiere通过其研究中提到的SpaceTime扩散模型来实现视频的一致性,该模型能够在不同帧之间创建更一致的镜头,即所谓的时间一致性。

  • Lumiere的image to video功能是如何工作的?

    -Lumiere的image to video功能可以将静态图像转换成动画,例如将一张熊在纽约行走的图片动画化,或者将大脚怪穿过森林的图片动画化。

  • Lumiere的styliz generation是如何实现的?

    -Lumiere使用目标图像来创建彩色或动画效果,例如,可以创建一个以大象为参考图像的动画,并保持其风格一致性。

  • 什么是Spacetime unit architecture,它在Lumiere中扮演什么角色?

    -Spacetime unit architecture是Lumiere中的一种架构,它能够一次性创建整个视频的概念,而不是像其他模型那样逐帧生成,这有助于保持视频的整体一致性。

  • Lumiere的视频风格化功能是什么?

    -视频风格化功能允许用户将源视频转换成不同的风格,例如,可以将跑步的女性视频转换成不同的风格,或者将狗、汽车和熊的视频风格化。

  • 什么是cinemagraphs,Lumiere如何实现这一功能?

    -Cinemagraphs是一种图像中只有特定部分动画化的技术。Lumiere通过AI猜测缺失图像部分的内容,例如,可以使图像中火车冒出的烟雾动画化。

  • Lumiere的视频和绘画功能是如何工作的?

    -视频和绘画功能允许AI猜测图像中缺失部分的内容,例如,如果图像中缺少一只手,Lumiere可以使用AI来猜测并填补这部分内容。

  • Lumiere在视频生成方面与其他AI模型相比有何优势?

    -Lumiere在文本到视频和图像到视频生成方面,用户偏好度高于其他最先进模型,如Pika和Gen-2,它在视频质量和与文本提示的一致性方面表现更好。

  • Lumiere的AI模型是否能够学习到比表面统计更深层次的内容?

    -根据Google的研究,尽管Lumiere的AI模型仅训练于二维图像,但它们似乎能够发展出一种内部线性表示,与场景几何学相关,这表明AI模型可能在学习到比表面统计更深层次的内容。

  • Runway ml提出的General World models是什么?

    -General World models是Runway ml提出的一个概念,它认为下一代AI的重大进步将来自于能够理解视觉世界及其动态的系统,即通过构建世界模型来理解它们生成的图像,并利用这些模型来创建更真实的视频。

  • Lumiere的Spacetime diffusion model在视频生成中扮演什么角色?

    -Lumiere的Spacetime diffusion model旨在创建能够展现现实、多样化和连贯运动的视频,它通过SpaceTime unet架构一次性生成整个视频的时间持续,以解决现有视频模型在全局时间一致性方面的挑战。

  • Lumiere的AI模型在视频生成方面有哪些创新之处?

    -Lumiere的AI模型创新之处在于它能够一次性生成整个视频的时间持续,而不是逐帧生成,这有助于保持视频的全局时间一致性,避免了对象在视频中出现不一致的情况。

  • Lumiere的AI模型在视频生成方面的表现如何?

    -根据Google的研究,Lumiere的AI模型在视频生成方面的表现优于其他最先进的模型,它能够创建更连贯、更一致的视频内容。

  • Lumiere的发布对视频制作领域意味着什么?

    -Lumiere的发布意味着视频制作领域将迎来重大变革,它使得普通用户也能够在家中创建具有好莱坞风格的电影,这将大大降低视频制作的门槛,推动个性化和创新性内容的产生。

Outlines

00:00

😲 Lumiere:Google发布革命性文本到视频AI工具

Google最新推出的AI工具Lumiere,是一个文本到视频的AI模型,能够将文本转化为视频。它不仅能够创建视频,还能对现有图像进行动画化,模仿图像或绘画的风格,甚至在图像中创建特定的动画部分。Google的研究显示,Lumiere在视频生成方面取得了显著进步,尤其是在保持视频帧之间的时间一致性方面。此外,Lumiere还能够进行图像到视频的转换,以及使用目标图像进行风格化生成。视频中展示了Lumiere生成的各种视频,包括根据文本提示生成的动态场景,以及将静态图像转化为动画的例子。这些视频展示了Lumiere在风格一致性和动画生成方面的强大能力。

05:00

🤖 AI如何创造视频:深度学习还是表面统计?

视频探讨了AI模型如何将文本提示转化为图像和视频,以及这些模型是否仅仅依赖表面统计还是在学习更深层次的内容。一些AI科学家认为,这些模型只是简单地记忆像素值和单词之间的表面相关性,而另一些人则认为AI模型正在进行更深层次的学习。为了探究这个问题,研究人员创建了一个仅接受2D图像训练的模型,并发现尽管该模型没有被教授任何关于深度或场景几何的知识,它似乎能够在内部线性表示中构建与场景几何相关的信息。研究表明,即使只训练了二维图像,这些AI模型也能够在神经网络内部构建一个3D表示,这为关于生成模型是否能够学习超越表面统计的持续辩论增添了新的视角。

10:01

🎬 视频制作的未来:AI在电影和电视制作中的应用

随着AI技术的发展,视频制作的未来可能会发生巨大变化。AI工具如Google的Lumiere和Runway ml等,能够根据文本提示生成高质量的视频,这可能会使得普通人们在家中就能够创作出具有好莱坞风格的电影。这些工具不仅能够生成视频,还能够进行视频风格化,甚至能够进行视频和绘画的结合。Runway ml还提出了General World models的概念,这是一种长期研究工作,旨在通过构建世界模型来理解视觉世界和其动态,从而推动AI在视频生成和机器人物理世界行为教学方面的进步。这种模型能够模拟整个环境,并从中提取所需的图像、文本或机器人行为。

15:03

📊 Lumiere与现有AI视频生成模型的比较

Lumiere在视频生成方面的表现与其他现有模型进行了比较。Lumiere采用了一种新的SpaceTime unet架构,能够一次性生成整个视频的时间持续,而不是像其他模型那样逐帧合成。这种方法有助于保持视频的全局时间一致性,使得视频中的对象在不同帧之间保持一致。在用户偏好测试中,Lumiere在文本到视频和图像到视频生成方面都优于其他模型,如Pika和Gen 2。Lumiere生成的视频在质量上得到了用户的高度评价,这表明它在视频对齐和视频质量方面都具有显著的优势。

20:05

🌟 AI视频生成的新兴趋势和未来展望

AI视频生成技术正在迅速发展,它为电影制作和故事叙述提供了新的可能性。随着技术的进步,我们可以预见到AI将在视频制作中扮演更加重要的角色。新兴的趋势包括使用AI来模拟整个世界和角色,然后从中选择特定的场景和角色来进行故事叙述。这使得电影制作者可以构建自己的世界,控制他们想要的元素,而AI则帮助生成视觉和声音效果。随着技术的不断进步,我们可以期待在未来一年内,AI生成的视频将变得更加逼真和易于制作。对于那些对电影制作感兴趣的人来说,现在正是关注和利用这些新兴技术的最佳时机。

Mindmap

Keywords

💡Lumiere

Lumiere是谷歌发布的最新AI工具,它的核心功能是将文本转换成视频。视频中提到Lumiere不仅仅是文本到视频的转换,还包括动画现有图像、视频绘画以及在图像中创建特定动画部分等高级功能。例如,用户可以输入文本提示,如'美国国旗在日出时飘扬',Lumiere会生成相应的视频内容。

💡Spacetime diffusion model

时空扩散模型是Lumiere背后的科学原理之一。这是一种用于生成逼真视频的模型,它通过模拟视频内容在时间和空间上的连贯性来提高视频生成的质量。视频中提到,与传统逐帧生成视频的模型不同,Lumiere的时空单元架构能够一次性生成整个视频的时序持续,从而保持全局时间一致性。

💡Text to video

文本到视频是Lumiere的一个关键功能,允许用户通过输入文本提示来生成视频。视频中提到了多个由文本提示生成的视频示例,如'穿着大耳机、摇摆头部的可爱腊肠犬',展示了Lumiere如何根据文本描述生成相应的视频内容。

💡Image to video

图像到视频是Lumiere的另一个功能,它能够将静态图像转换成动画视频。视频中提到了如'在纽约行走的熊'和'穿越森林的大脚怪'等例子,这些都是从静态图像开始,然后通过AI技术进行动画化处理的结果。

💡Stylization

风格化是指Lumiere能够根据目标图像的风格来生成视频或动画。视频中展示了如何使用风格参考图像来创建具有特定风格的动画,例如'熊在欢乐地旋转'或'海豚跃出水面',这些例子展示了Lumiere如何捕捉并应用风格参考图像的视觉元素。

💡Video stylization

视频风格化是Lumiere的一个功能,它允许用户将源视频转换成具有特定风格的视频。视频中提到了如何将跑步的女性、狗、汽车和熊的源视频转换成具有不同风格的视频,这展示了Lumiere在视频风格转换方面的能力。

💡Cinemagraphs

电影画是一种动画技术,它允许在静态图像中只有某些部分是动画的。视频中提到了Runway ml最近发布的类似功能,并指出Google也在开发类似的能力,例如在图像中只有火车冒出的烟雾是动画的。

💡Video and painting

视频和绘画是Lumiere的一个功能,它可以在图像缺失的部分使用AI进行填充,生成看起来合理的内容。视频中提到了一个例子,其中一个人的手进入图像的缺失部分,AI能够猜测并填充缺失的部分,如投掷绿叶的动作。

💡General World models

通用世界模型是Runway ml提出的一个概念,它指的是AI系统内部构建的环境表示,用于模拟环境中的未来事件。视频中提到,Runway ml认为下一代AI的重大进步将来自于能够理解视觉世界及其动态的系统,这些系统将能够创建更加连贯和逼真的视频。

💡AI cinematography

AI电影摄影是指使用AI技术来辅助或自动化电影制作的过程。视频中提到,随着Lumiere等工具的出现,有抱负的电影制作者可以利用这些工具来创建逼真的场景,而不受资金限制,这可能会带来一种全新的电影制作风格和流派。

Highlights

谷歌发布最新AI工具Lumiere,它是一个文本到视频的AI模型,可以将文本转换成视频。

Lumiere不仅支持文本到视频,还能使现有图像动画化,创造出具有特定风格的视频。

谷歌研究论文介绍了Lumiere的改进之处,展示了AI生成视频的科学原理。

Lumiere实现了跨帧的时间一致性,提高了视频的连贯性。

Lumiere支持图像到视频的转换,例如将静态熊的图片转换成在纽约行走的动画。

Lumiere可以进行风格化生成,使用目标图像来制作彩色或动画效果。

Lumiere引入了Spacetime单元架构,整个视频的概念一次性生成,而不是逐帧。

Lumiere支持视频风格化,可以将源视频转换成不同风格。

Lumiere实现了电影图(cinemographs)功能,即在图像中仅动画化特定部分。

Lumiere能够进行视频绘画,用AI猜测缺失图像部分的样子。

Lumiere在视频生成中展现了先进的一致性和连贯性,与一年前的AI生成视频相比有显著进步。

Lumiere在用户偏好测试中,在文本到视频和图像到视频生成方面优于其他最先进模型。

Lumiere展示了AI在视频生成中的潜力,可能会改变电影和电视制作的未来。

Runway ML作为领先的文本到图像AI模型之一,最近发布了General World models的概念。

General World models旨在通过构建世界模型来理解视觉世界及其动态,推动AI的下一个重大进步。

Lumiere的Spacetime diffusion模型旨在创造现实、多样化和连贯的运动视频。

Lumiere通过SpaceTime U-Net架构一次性生成整个视频的时间持续,提高了全局时间一致性。

Lumiere在文本对齐和视频质量方面优于其他模型,提供了更准确和高质量的视频生成。

Lumiere的进展预示着AI在视频制作和世界模拟方面的未来,为创意人才提供了无限可能。

Transcripts

play00:00

and just like that out of the blue

play00:02

Google drops its latest AI tool Lumiere

play00:05

Lumiere is at its core a text to video

play00:08

AI model you type in text and the AI

play00:11

neural Nets translate that into video

play00:15

but as you'll see Lumiere is a lot more

play00:17

than just text to

play00:19

video it allows you to animate existing

play00:22

images creating video and the style of

play00:25

that image or painting as well as things

play00:27

like video in painting and creating

play00:30

specific animation sections within

play00:32

images so let's look at what it can do

play00:35

the science behind it Google published a

play00:38

paper talking about what they improved

play00:40

and I'll also show you why the

play00:42

artificial AI brains that generate these

play00:46

videos are much more weird than you can

play00:50

imagine so this is lumere from Google

play00:52

research A Spacetime diffusion model for

play00:55

realistic video generation we'll cover

play00:57

SpaceTime diffusion model a bit later

play00:59

but right now now this is what they're

play01:01

unveiling so first of all there's text

play01:03

to video this is the video that are

play01:04

produced by various prompts like US flag

play01:07

waving on massive Sunrise clouds funny

play01:09

cute pug dog feeling good listening to

play01:11

music with big headphones and Swinging

play01:13

head Etc snowboarding Jack Russell

play01:16

Terrier so I got to say these are

play01:18

looking pretty good if these are good

play01:19

representations of the sort of style

play01:21

that we can get from this model this

play01:23

would be very interesting so for example

play01:25

take a look at this one astronaut on the

play01:27

planet Mars making a detour around his

play01:30

base this is looking very consistent

play01:33

this looks like a tablet this looks like

play01:35

a medicine tablet of some sort floating

play01:37

in space but I got to say everything is

play01:39

looking very consistent which is what

play01:42

they're promising in their research it

play01:43

looks like they found a way to create a

play01:45

more consistent shot across different

play01:47

frames temporal consistency as they call

play01:50

it here's image to video so as you can

play01:52

see that this is nightmarish but that's

play01:54

that's the scary looking one but other

play01:56

than that everything else is looking

play01:58

really good so they're taking IM images

play02:00

and turning them into animations little

play02:03

animations of a bear walking in New York

play02:05

for example Bigfoot walking through the

play02:08

woods so these were started with an

play02:10

image that then gets animated these are

play02:13

looking pretty good here are the Pillars

play02:15

of Creation animated right there that's

play02:17

uh pretty neat kind of a 3D structure

play02:20

they're showing styliz generation so

play02:22

using a Target image to kind of make

play02:24

something colorful or animated take a

play02:26

look at this elephant right here one

play02:28

thing that jumps out at me is it is very

play02:30

consistent there's no weirdness going on

play02:33

in a second we'll take a look at other

play02:34

leading AI models that generate video

play02:37

and I got to say this one is probably

play02:39

the smoothest looking one here's another

play02:41

one so as you can see here here's the

play02:43

style reference image so they want this

play02:45

style and then they say a bear twirling

play02:47

with delight for example right so then

play02:49

it creates a bear twirling with delight

play02:51

or a dolin leaping out of the water in

play02:53

the style of this image here's the same

play02:55

or similar prompts with this as the

play02:58

style reference now this as a the style

play03:00

reference I got to say it captures the

play03:02

style pretty well here's kind of that

play03:04

neon phosphorus glowing thing and they

play03:07

introduce A Spacetime unit architecture

play03:09

and we'll look at that towards the end

play03:10

of the video but basically it sounds

play03:12

like it creates sort of the idea of the

play03:14

entire video at once so while other

play03:17

models it seems like kind of go frame by

play03:19

frame this one has sort of an idea of

play03:21

what the whole thing is going to look

play03:22

like at the very beginning and there's a

play03:24

video stylization so here's a lady

play03:26

running this is the source video and the

play03:28

various craziness that you can make her

play03:30

into the same thing with a dog and a car

play03:34

and a bear cinemagraphs is the ability

play03:36

to animate only certain portions of the

play03:38

image like the smoke coming out of this

play03:40

train this is something that Runway ml I

play03:42

believe recently released and looks like

play03:44

Google is hot on their heels creating

play03:46

basically the same ability then we have

play03:48

video and painting So if a portion of an

play03:50

image is missing you're able to use AI

play03:52

to sort of guess at what that would look

play03:54

like I got to say so here where the hand

play03:55

comes in that is very interesting cuz

play03:57

that seems kind of advanced cuz notice

play03:59

in the beginning he throws the Green

play04:01

Leaf in the missing portion of the image

play04:03

and then you see him coming back to the

play04:06

image that we can see throwing a green

play04:07

leaf or two so it makes the assumption

play04:09

that hey the things there will also be

play04:12

green leaves interestingly enough though

play04:14

I do feel like I can spot a mistake here

play04:16

the leaves that are already on there are

play04:18

fresh looking as opposed to the cooked

play04:20

ones like they are on this side so it

play04:22

knows to put in the green leaves as the

play04:24

guy is throwing them for them to be

play04:25

fresh because it matches the fresh

play04:27

leaves here but it misses the point that

play04:28

hey these are cooked leaves and these

play04:30

are fresh but still it's very impressive

play04:33

that it's able to sort of to sort of

play04:34

guess at what's happening in that moment

play04:37

and this is where if you've been

play04:38

following some of the latest AI research

play04:40

this is where these neural Nets get a

play04:41

little bit weird well again come back to

play04:43

that at the end but how they are able to

play04:46

predict certain things like what happens

play04:47

here for example like no one codes it to

play04:50

know that this is probably a cake of

play04:52

some sort nobody tells it what this

play04:54

thing is it guesses from clues that it

play04:57

sees on screen but how does that is

play05:00

really really weird let's just say that

play05:02

this is pretty impressive so here we're

play05:04

able to change the clothes that the

play05:06

person is wearing throughout these shots

play05:07

while you know notice the hat and the

play05:09

face they kind of remain consistent

play05:10

across all the shots whereas the dress

play05:13

is changed based on a text prompt as you

play05:15

watch this think about where video

play05:18

production for movies and serial TV

play05:20

shows Etc where that's going to be in 5

play05:23

to 10 years will something like this

play05:24

allow everyday people sitting at home to

play05:27

create stunning Hollywood style movies

play05:29

with whatever characters they want

play05:31

whatever settings they want with'

play05:32

generated video and AI voices we can

play05:35

create a movie starting Hugh Hefner as a

play05:36

chicken for example so really fast this

play05:38

is another study called Beyond surface

play05:40

statistics out of hardw so this has

play05:41

nothing to do with the Google project

play05:44

that we're looking at but this paper

play05:45

tries to answer the question of how do

play05:48

these models how do they create images

play05:50

how do they create videos as you can see

play05:52

here it says these models are capable of

play05:54

synthesizing high quality images but it

play05:56

remains a mystery how these networks

play05:58

transform let's say the phrase car in

play06:00

the street into a picture of a car in a

play06:02

street so in other words when we type in

play06:04

this when a human person says draw a

play06:06

picture of a car in a street or a video

play06:08

of a car in a street how does that thing

play06:10

do it how does it translate that into a

play06:12

picture do they simply memorize

play06:14

superficial correlations between pixel

play06:16

values and words or are they learning

play06:18

something deeper such as the underlying

play06:20

model of objects such as cars roads and

play06:23

how they are typically positioned and

play06:25

there's a bit of a argument going on in

play06:27

the scientific Community about this so

play06:29

some AI scientists say all it is is just

play06:32

sort of surface level statistics they're

play06:34

just memorizing where these little

play06:36

pixels go and they're able to kind of

play06:38

reproduce certain images Etc and some

play06:40

people say well no there's something

play06:42

deeper going on here something new and

play06:44

surprising that these AI models are

play06:46

doing so what they did is they created a

play06:48

model that was fed nothing but 2D images

play06:51

so images of cars and people and ships

play06:54

Etc but that model it wasn't taught

play06:57

anything about depth like depth of field

play07:00

like where the foreground of an image is

play07:01

or where the background of an image is

play07:03

it wasn't taught about what the focus of

play07:05

the image is what a car is ETC and what

play07:07

they found is so here's kind of like the

play07:09

decoded image so this is kind of how it

play07:11

makes it from step one to finally step

play07:14

15 where as you can see you can see this

play07:16

is a car so a human being would be able

play07:18

to point at this and say that's a car

play07:20

what in the image is closest to you the

play07:23

person taking the image you say well

play07:24

probably this wheel is the closest right

play07:26

this is the the kind of the foreground

play07:28

this is the main object and that's kind

play07:29

of the background that's far far away

play07:31

and this is close right but the reason

play07:33

that you are able to look at this image

play07:35

and know that is because you've seen

play07:37

these objects in the real world in the

play07:39

3D world you can probably imagine how

play07:41

this image would look if you're standing

play07:43

off the side here looking at it from

play07:45

this direction this AI model that made

play07:47

this has no idea about any of that all

play07:49

it's seeing is a bunch of these 2D

play07:51

images just pixels arranged in a screen

play07:53

and yet when we dive into try to

play07:55

understand how it's building these

play07:57

images from scratch this is what we

play07:59

start to notice so early on when it's

play08:01

building this image this is kind of what

play08:04

the the depth of the image looks like so

play08:07

very early on it knows that sort of this

play08:10

thing is in the foreground it's closer

play08:13

to us and this right here the blue

play08:15

that's the background it's far from us

play08:17

now looking at this image you can't

play08:19

possibly tell what this is going to be

play08:21

you can't tell what this is going to be

play08:22

till much much later maybe here we can

play08:25

kind of begin to start seeing some of

play08:27

the lines that are in here but that's

play08:28

about it you you see like the wheels and

play08:31

maybe you could guess of what that is

play08:33

but here in the beginning you have no

play08:34

idea and yet the model knows that

play08:35

something right here is in the

play08:36

foreground something's in the background

play08:38

and towards the end it knows that this

play08:40

is closer this is close and this is far

play08:43

this is Salient object meaning like what

play08:44

is the focus what is the main object so

play08:46

it knows that the main object is here it

play08:48

doesn't know what a car is it doesn't

play08:49

know what an object is it just knows

play08:51

like this is the the focus of the image

play08:53

again only towards much later do we

play08:55

realize that yes in fact this is the car

play08:57

and so this is the conclusion of the

play08:58

paper our experiments provide evidence

play09:00

that stable diffusion model so this is

play09:02

an image generating model AI although

play09:05

solely trained on two-dimensional images

play09:07

contain an internal linear

play09:09

representation related to scene geometry

play09:11

so in other words after seeing thousands

play09:14

or millions of 2D images inside its

play09:17

neural network inside of its brain it

play09:19

seems like and again this is a lot of

play09:22

people sort of dispute this but some of

play09:24

these research makes it seem like it's

play09:26

developing its neural net that allows it

play09:29

to create a 3D representation of that

play09:32

image even though it's never been taught

play09:34

what 3D means it uncovers a salent

play09:37

object or sort of that main Center

play09:39

object that it needs to focus on versus

play09:41

the background of the image as well as

play09:43

information related to relative depth

play09:45

and these representations emerge early

play09:47

so before it starts painting the colors

play09:50

or the little shapes or the the wheels

play09:52

and the Shadows it first starts thinking

play09:54

about the 3D space on which it's going

play09:56

to start painting that image and here

play09:58

they say these results add a Nuance to

play10:00

the ongoing debates and there are a lot

play10:02

of ongoing debates about this about

play10:04

whether generative models so these AI

play10:06

models can learn more than just surface

play10:08

statistics in other words is there some

play10:11

sort of understanding that's going on

play10:13

maybe not like human understanding but

play10:15

is it just statistics or is there

play10:18

something deeper that's happening and

play10:21

this is Runway ml so this is the other

play10:23

one of the leading sort of text 2 image

play10:26

AI models and you might have seen the

play10:29

images so as you can see here this is

play10:31

what they're offering people have made

play10:33

full movies maybe not hour long but

play10:35

maybe 10 minutes 20 minute movies that

play10:38

are entirely generated by AI so as you

play10:41

can see here it's it's similar to what

play10:44

Google is offering although I got to say

play10:47

after looking at Google's work and then

play10:49

this one Google's does seem just a

play10:51

little bit more consistent I would say

play10:53

there seems to be a little bit less

play10:54

shifting and and shapes going on it's

play10:56

just a little bit more consistent across

play10:58

time time and they have a lot of the

play11:00

same thing like this stylization here

play11:01

from a reference video to this image

play11:04

that's like the style reference but the

play11:06

interesting thing here is this is in the

play11:08

last few months looks like December 2023

play11:11

Runway nml introduced something they

play11:13

call General World models and they're

play11:15

saying we believe the next major

play11:17

advancement in AI will come from systems

play11:19

that understand the visual world and its

play11:21

Dynamics they're starting a long-term

play11:23

research effort around what they call

play11:24

General World models so their whole idea

play11:27

is that instead of the video AI models

play11:30

creating little Clips here and there

play11:32

with little isolated subjects and

play11:34

movements that a better approach would

play11:36

be to actually use the neural networks

play11:39

and them building some sort of a world

play11:41

model to understand the images they're

play11:43

making and to actually utilize that to

play11:46

have it almost create like a little

play11:47

world so for example if you're creating

play11:49

a clip with multiple characters talking

play11:52

then the AI model would actually almost

play11:54

simulate that entire world with the with

play11:56

the rooms and the people and then the

play11:58

people would talk talk to each other and

play12:00

it would just take that clip but it

play12:01

would basically create much more than

play12:03

just a clip like if a bird is flying

play12:05

across the sky it would be simulating

play12:07

the wind and the physics and all that

play12:10

stuff to try to capture the movement of

play12:12

that bird to create realistic images and

play12:14

video so they're saying a world model is

play12:16

an AI system that builds an internal

play12:18

representation of an environment and it

play12:20

uses it to simulate future events within

play12:22

that environment so for example for Gen

play12:24

2 which is their model their video model

play12:27

to generate realistic short video it has

play12:29

developed some understanding of physics

play12:32

and motion card still very limited

play12:34

struggling of complex camera controls or

play12:36

object motions amongst other things but

play12:39

they believe and a lot of other

play12:40

researchers as well that this is sort of

play12:42

the next step for us to get better at

play12:45

creating video at teaching robots how to

play12:47

behave in the physical world like for

play12:49

example the nvidia's foundation agent

play12:51

then we need to create bigger models

play12:53

that simulate entire worlds and then

play12:55

from those worlds they pull out what we

play12:57

need whether that's an image or text or

play12:59

a robot's ability to open doors and pick

play13:02

up objects all right but now back to

play13:04

Lumiere A Spacetime diffusion model for

play13:06

video generation so here they have a

play13:08

number of examples for the text to video

play13:11

of image to video stylized generation

play13:14

Etc and so in lumier they're trying to

play13:16

build this text video diffusion model

play13:19

that can create videos that portray

play13:20

realistic diverse and coherent motion a

play13:23

pivotal challenge in video synthesis and

play13:25

so the new thing that they introduces

play13:26

the SpaceTime unet archit tecture that

play13:29

generates entire temporal duration of

play13:31

the video at once so in other words it

play13:33

sort of thinks through how the entire

play13:36

video going to look like in the

play13:37

beginning as opposed to existing video

play13:39

models so other video models which

play13:41

synthesize distant key frames followed

play13:43

by temporal super solution basically

play13:45

meaning they do it one at a time so they

play13:47

start with one and then create the

play13:49

others and they're saying that makes

play13:50

Global temporal consistency difficult

play13:52

meaning that the object as as you watch

play13:54

a video of it right it looks a certain

play13:55

way on the first second of the video but

play13:58

by second five is just completely

play14:00

different and so here basically they're

play14:01

comparing these two videos so imagin and

play14:03

rs so The Lumiere model as you can see

play14:06

here here sample a few clips and they're

play14:08

looking at the XT slice so the XT slice

play14:12

you can basically think of that as so

play14:14

for example in stocks you have you know

play14:16

the price of stock over time right so it

play14:18

kind of goes like this here the x is the

play14:22

spatial Dimension so where certain

play14:24

things are in space on the image versus

play14:26

T temporal the time so the X here is

play14:29

basically where we might be looking at

play14:30

the width of the image for example of

play14:33

any image in time and T the temporal is

play14:36

like how consistent is across time so as

play14:37

you can see hit this green line so we're

play14:38

just looking at this thing across the

play14:40

entire image and this is what that looks

play14:42

like so as you can see here this is

play14:44

going pretty well and then it kind of

play14:46

messes up and it kind of gets crazy here

play14:48

and then kind of goes back to doing okay

play14:50

whereas in Lumiere it's pretty pretty

play14:54

good I mean maybe some funkiness right

play14:56

there in one one frame but it's pretty

play14:58

good same thing here I mean this is as

play15:00

you can see here pretty good maybe you

play15:03

can say that there's a little bit of

play15:04

funkiness here but overall it's very

play15:06

good whereas in this image and video I

play15:09

mean as you can see here there's kind of

play15:11

like a lot of nonsense that's happening

play15:13

right and so here you can see like you

play15:15

can't tell how many legs it has if it's

play15:17

missing a leg Etc whereas in The Lumiere

play15:20

I mean I feel like the you know you can

play15:22

see each of the legs pretty distinctly

play15:25

and their position and it's remains

play15:27

consistent across time or at least

play15:29

consistently easy to see where they are

play15:31

but I got to say I can't wait to get my

play15:33

hands on it it looks like as of right

play15:35

now I don't see a way to access it this

play15:37

is just sort of a preview but hopefully

play15:40

they will open up for testing soon and

play15:42

we'll be able to get our hands on it and

play15:43

check it out and here interestingly

play15:45

enough they actually compare how well

play15:47

their performs against the other

play15:49

state-of-the-art models in the in the

play15:51

industry so the two that I'm familiar

play15:53

with is Pika and genan 2 those are the

play15:56

two that I've used and they're saying

play15:57

that their video video is preferred by

play15:59

users in both text to video and image to

play16:02

video generation so blue is theirs and

play16:05

the Baseline is the orange one so it

play16:07

seems like there are pretty big

play16:09

differences in every single one this

play16:11

seems like video quality I mean it beats

play16:13

out every single other one of these

play16:15

which which I believe this text

play16:17

alignment which here means probably how

play16:19

well the image how true it is to The

play16:21

Prompt right so if you type in a prompt

play16:24

how accurately it represents it so it

play16:26

looks like maybe image is the closest

play16:28

one but it beats out most of the other

play16:30

ones by quite a bit and then video

play16:32

quality of image to video it seems like

play16:34

it beats them out as well with Gen 2

play16:37

probably being the next best one and

play16:39

here they provide a side-by-side

play16:41

comparison so for example the first

play16:42

prompt is a sheep to the right of a wine

play16:45

glass so this is Pika which which not

play16:48

great CU there's no wine glass here's

play16:50

Gen 2 consistently putting it on the

play16:53

left anime diff which just has two

play16:55

glasses and maybe a reflection of a

play16:57

sheep image and video same thing so the

play16:59

glasses on the left zero scope no

play17:02

glasses that I can see although they

play17:03

have sheep and of course R so the Lumi

play17:06

the Google one is it seems like a nail

play17:08

it in every single one the glass is on

play17:10

the right although I got to say Gen 2 is

play17:12

is great although it confused the left

play17:14

and right but other than that I mean

play17:16

same if image and video actually

play17:18

although I feel like Gen 2 the quality

play17:20

is much better of the sheep cuz that's

play17:22

you know that's a good-looking sheep I

play17:24

should probably rephrase that that's a

play17:27

well rendered sheep how about that

play17:29

versus imagin I mean that's a weird

play17:32

looking thing there that could almost be

play17:33

a horse or a cow if you just look at the

play17:35

face and Google is again excellent

play17:38

here's teddy bear skating in Time Square

play17:41

this is Google this is imag again

play17:43

weirdness happening there and that's gen

play17:45

two again pretty good but I mean the the

play17:47

thing is facing away although here I

play17:49

just noticed so they they took skating

play17:51

to mean ice skates whereas here it looks

play17:53

like these are roller skates skateboard

play17:55

Etc and so it looks like in the study

play17:57

they just showed you two to things they

play17:59

say do you like the left or the right

play18:00

more based on motion and better quality

play18:03

well I got to say if you're an aspiring

play18:05

AI cinematographer then this is really

play18:08

good news consistent coherent images

play18:11

that are able to create near lifelike

play18:15

scenes at this point I mean I'm sure

play18:17

there's other people that'll complain

play18:18

about stuff but you got to realize how

play18:21

quickly the stuff is progressing just to

play18:23

give you an idea this is about a year

play18:26

ago or so this is what a I generated

play18:29

video looked like so can you tell that

play18:33

is improved just a little bit that's

play18:36

about a year I'm not sure exactly when

play18:37

this was done but I'm going to say a

play18:39

year year and a half ago and I mean this

play18:41

thing gets nightmarish so when I'm

play18:44

talking about weird blocky shapes things

play18:47

not being consistent across scenes like

play18:51

what are we even looking at

play18:53

here is this a mouth is this a building

play18:56

and here's kind of uh something from

play18:58

about 4 months ago from Pika Labs so as

play19:01

you can see here it's much better it's

play19:04

much more consistent right as you can

play19:06

see here humans again maybe they look a

play19:09

little bit weird but it's better it can

play19:11

put you in the moment if you're telling

play19:13

a story that's not necessarily about

play19:15

everything looking realistic something

play19:18

like this can be created pretty easily

play19:20

and since it's new it's novel people

play19:23

might be this might be a whole new

play19:25

movement a new genre of film making

play19:28

that's new exciting and never before

play19:30

seen and most importantly it's easy to

play19:33

create with a you know at home with a

play19:36

few AI tools and anybody out there with

play19:39

creative abilities with creative talent

play19:41

to tell the stories that they have in

play19:43

their mind without being limited

play19:46

financially by Capital they're going to

play19:49

be able to create AI voices they're

play19:51

going to be able to create AI footage

play19:54

maybe even have Chad GPT help them with

play19:56

some of the story writing and once more

play19:58

the sort of the next generation of

play20:00

things that we're seeing that people are

play20:01

working on is things like the similation

play20:04

where you create the characters and then

play20:06

you sort of let them loose in a world

play20:08

they get simulated with these they get

play20:10

sort of simulated so the stories kind of

play20:12

play out in the world and then you sort

play20:14

of pick and choose what to focus on

play20:17

which scenes and which characters you

play20:19

want to bring to the front so you

play20:21

basically act as the World Builder you

play20:24

build the worlds the characters the

play20:25

narratives and AI assists you in

play20:28

creating the visuals the voices Etc and

play20:31

you can be 100% in control of it or you

play20:34

can only control the things that you

play20:36

want and the AI generates the rest so to

play20:39

me this if you're interested in movie

play20:41

making and you like these sort of styles

play20:44

that by the way quickly will become much

play20:46

more realistic I would be really looking

play20:49

at this right now because right now is

play20:51

the time that it's sort of emerging into

play20:54

the world and getting really good and

play20:57

it's going to get better by next year

play20:59

it's going to be a lot

play21:01

better well my name is Wes rth and uh

play21:04

thank you for watching

Rate This

5.0 / 5 (0 votes)

Связанные теги
谷歌AI文本转视频Lumiere神经网络视频生成图像动画风格转换视频编辑AI研究视频制作创意工具
Вам нужно реферат на русском языке?