Google's LUMIERE AI Video Generation Has Everyone Stunned | Better than RunWay ML?
Summary
TLDR谷歌最新推出的AI工具Lumiere,核心功能是将文本转换成视频。它不仅能够生成视频,还能对现有图像进行动画化,创造出具有特定风格的视频,例如“视频绘画”和在图像中创建特定动画部分。谷歌的研究论文揭示了其背后的科学原理,即时空扩散模型,该模型能够生成真实感强、多样化和连贯性强的视频。Lumiere在文本到视频、图像到视频、风格化生成等方面展示了其能力,与现有的视频模型相比,它在用户偏好和视频质量方面都表现出色。这项技术的发展预示着未来视频制作可能会变得更加容易和可访问,为电影制作和故事叙述开辟了新的可能性。
Takeaways
- 🚀 Google发布了最新的AI工具Lumiere,它是一个文本到视频的AI模型,可以将文本转换为视频。
- 🎨 Lumiere不仅能将文本转换为视频,还能对现有图像进行动画处理,创造出具有特定风格或绘画风格的视频。
- 📈 Google的研究论文提到了他们对SpaceTime扩散模型的改进,这种模型能够生成逼真的视频。
- 🤖 AI生成的视频在风格和动作上表现出了高度的一致性,这在以往的模型中是一个挑战。
- 🌌 Lumiere能够将静态图像转换成动画,例如将一幅熊的图片转换成在纽约行走的动画。
- 🎭 Lumiere通过使用目标图像来生成风格化的视频,例如让一只熊以某种风格旋转跳跃。
- 📹 Lumiere引入了Spacetime单元架构,这种架构能够在一开始就构建整个视频的概念,而不是逐帧生成。
- 🎨 Lumiere还包括视频风格化功能,可以改变视频的特定风格,例如只对视频中的某些部分进行动画处理。
- 🧩 Lumiere还能够进行视频和绘画,即使图像中有一部分缺失,AI也能猜测补充缺失的部分。
- 📈 通过与其他顶尖AI模型的比较,Lumiere在文本到视频和图像到视频生成方面被用户偏好。
- ⏱️ Lumiere在视频生成中实现了更好的全局时间一致性,与逐帧生成的视频模型相比,它在视频的整个时间跨度上保持了一致性。
Q & A
Google最新推出的AI工具Lumiere的核心功能是什么?
-Lumiere的核心功能是文本到视频的AI模型,用户输入文本后,AI神经网络将其翻译成视频。此外,它还能动画化现有图像,创建具有特定风格的视频,以及在图像中创建特定的动画部分。
Lumiere如何实现视频的一致性?
-Lumiere通过其研究中提到的SpaceTime扩散模型来实现视频的一致性,该模型能够在不同帧之间创建更一致的镜头,即所谓的时间一致性。
Lumiere的image to video功能是如何工作的?
-Lumiere的image to video功能可以将静态图像转换成动画,例如将一张熊在纽约行走的图片动画化,或者将大脚怪穿过森林的图片动画化。
Lumiere的styliz generation是如何实现的?
-Lumiere使用目标图像来创建彩色或动画效果,例如,可以创建一个以大象为参考图像的动画,并保持其风格一致性。
什么是Spacetime unit architecture,它在Lumiere中扮演什么角色?
-Spacetime unit architecture是Lumiere中的一种架构,它能够一次性创建整个视频的概念,而不是像其他模型那样逐帧生成,这有助于保持视频的整体一致性。
Lumiere的视频风格化功能是什么?
-视频风格化功能允许用户将源视频转换成不同的风格,例如,可以将跑步的女性视频转换成不同的风格,或者将狗、汽车和熊的视频风格化。
什么是cinemagraphs,Lumiere如何实现这一功能?
-Cinemagraphs是一种图像中只有特定部分动画化的技术。Lumiere通过AI猜测缺失图像部分的内容,例如,可以使图像中火车冒出的烟雾动画化。
Lumiere的视频和绘画功能是如何工作的?
-视频和绘画功能允许AI猜测图像中缺失部分的内容,例如,如果图像中缺少一只手,Lumiere可以使用AI来猜测并填补这部分内容。
Lumiere在视频生成方面与其他AI模型相比有何优势?
-Lumiere在文本到视频和图像到视频生成方面,用户偏好度高于其他最先进模型,如Pika和Gen-2,它在视频质量和与文本提示的一致性方面表现更好。
Lumiere的AI模型是否能够学习到比表面统计更深层次的内容?
-根据Google的研究,尽管Lumiere的AI模型仅训练于二维图像,但它们似乎能够发展出一种内部线性表示,与场景几何学相关,这表明AI模型可能在学习到比表面统计更深层次的内容。
Runway ml提出的General World models是什么?
-General World models是Runway ml提出的一个概念,它认为下一代AI的重大进步将来自于能够理解视觉世界及其动态的系统,即通过构建世界模型来理解它们生成的图像,并利用这些模型来创建更真实的视频。
Lumiere的Spacetime diffusion model在视频生成中扮演什么角色?
-Lumiere的Spacetime diffusion model旨在创建能够展现现实、多样化和连贯运动的视频,它通过SpaceTime unet架构一次性生成整个视频的时间持续,以解决现有视频模型在全局时间一致性方面的挑战。
Lumiere的AI模型在视频生成方面有哪些创新之处?
-Lumiere的AI模型创新之处在于它能够一次性生成整个视频的时间持续,而不是逐帧生成,这有助于保持视频的全局时间一致性,避免了对象在视频中出现不一致的情况。
Lumiere的AI模型在视频生成方面的表现如何?
-根据Google的研究,Lumiere的AI模型在视频生成方面的表现优于其他最先进的模型,它能够创建更连贯、更一致的视频内容。
Lumiere的发布对视频制作领域意味着什么?
-Lumiere的发布意味着视频制作领域将迎来重大变革,它使得普通用户也能够在家中创建具有好莱坞风格的电影,这将大大降低视频制作的门槛,推动个性化和创新性内容的产生。
Outlines
😲 Lumiere:Google发布革命性文本到视频AI工具
Google最新推出的AI工具Lumiere,是一个文本到视频的AI模型,能够将文本转化为视频。它不仅能够创建视频,还能对现有图像进行动画化,模仿图像或绘画的风格,甚至在图像中创建特定的动画部分。Google的研究显示,Lumiere在视频生成方面取得了显著进步,尤其是在保持视频帧之间的时间一致性方面。此外,Lumiere还能够进行图像到视频的转换,以及使用目标图像进行风格化生成。视频中展示了Lumiere生成的各种视频,包括根据文本提示生成的动态场景,以及将静态图像转化为动画的例子。这些视频展示了Lumiere在风格一致性和动画生成方面的强大能力。
🤖 AI如何创造视频:深度学习还是表面统计?
视频探讨了AI模型如何将文本提示转化为图像和视频,以及这些模型是否仅仅依赖表面统计还是在学习更深层次的内容。一些AI科学家认为,这些模型只是简单地记忆像素值和单词之间的表面相关性,而另一些人则认为AI模型正在进行更深层次的学习。为了探究这个问题,研究人员创建了一个仅接受2D图像训练的模型,并发现尽管该模型没有被教授任何关于深度或场景几何的知识,它似乎能够在内部线性表示中构建与场景几何相关的信息。研究表明,即使只训练了二维图像,这些AI模型也能够在神经网络内部构建一个3D表示,这为关于生成模型是否能够学习超越表面统计的持续辩论增添了新的视角。
🎬 视频制作的未来:AI在电影和电视制作中的应用
随着AI技术的发展,视频制作的未来可能会发生巨大变化。AI工具如Google的Lumiere和Runway ml等,能够根据文本提示生成高质量的视频,这可能会使得普通人们在家中就能够创作出具有好莱坞风格的电影。这些工具不仅能够生成视频,还能够进行视频风格化,甚至能够进行视频和绘画的结合。Runway ml还提出了General World models的概念,这是一种长期研究工作,旨在通过构建世界模型来理解视觉世界和其动态,从而推动AI在视频生成和机器人物理世界行为教学方面的进步。这种模型能够模拟整个环境,并从中提取所需的图像、文本或机器人行为。
📊 Lumiere与现有AI视频生成模型的比较
Lumiere在视频生成方面的表现与其他现有模型进行了比较。Lumiere采用了一种新的SpaceTime unet架构,能够一次性生成整个视频的时间持续,而不是像其他模型那样逐帧合成。这种方法有助于保持视频的全局时间一致性,使得视频中的对象在不同帧之间保持一致。在用户偏好测试中,Lumiere在文本到视频和图像到视频生成方面都优于其他模型,如Pika和Gen 2。Lumiere生成的视频在质量上得到了用户的高度评价,这表明它在视频对齐和视频质量方面都具有显著的优势。
🌟 AI视频生成的新兴趋势和未来展望
AI视频生成技术正在迅速发展,它为电影制作和故事叙述提供了新的可能性。随着技术的进步,我们可以预见到AI将在视频制作中扮演更加重要的角色。新兴的趋势包括使用AI来模拟整个世界和角色,然后从中选择特定的场景和角色来进行故事叙述。这使得电影制作者可以构建自己的世界,控制他们想要的元素,而AI则帮助生成视觉和声音效果。随着技术的不断进步,我们可以期待在未来一年内,AI生成的视频将变得更加逼真和易于制作。对于那些对电影制作感兴趣的人来说,现在正是关注和利用这些新兴技术的最佳时机。
Mindmap
Keywords
💡Lumiere
💡Spacetime diffusion model
💡Text to video
💡Image to video
💡Stylization
💡Video stylization
💡Cinemagraphs
💡Video and painting
💡General World models
💡AI cinematography
Highlights
谷歌发布最新AI工具Lumiere,它是一个文本到视频的AI模型,可以将文本转换成视频。
Lumiere不仅支持文本到视频,还能使现有图像动画化,创造出具有特定风格的视频。
谷歌研究论文介绍了Lumiere的改进之处,展示了AI生成视频的科学原理。
Lumiere实现了跨帧的时间一致性,提高了视频的连贯性。
Lumiere支持图像到视频的转换,例如将静态熊的图片转换成在纽约行走的动画。
Lumiere可以进行风格化生成,使用目标图像来制作彩色或动画效果。
Lumiere引入了Spacetime单元架构,整个视频的概念一次性生成,而不是逐帧。
Lumiere支持视频风格化,可以将源视频转换成不同风格。
Lumiere实现了电影图(cinemographs)功能,即在图像中仅动画化特定部分。
Lumiere能够进行视频绘画,用AI猜测缺失图像部分的样子。
Lumiere在视频生成中展现了先进的一致性和连贯性,与一年前的AI生成视频相比有显著进步。
Lumiere在用户偏好测试中,在文本到视频和图像到视频生成方面优于其他最先进模型。
Lumiere展示了AI在视频生成中的潜力,可能会改变电影和电视制作的未来。
Runway ML作为领先的文本到图像AI模型之一,最近发布了General World models的概念。
General World models旨在通过构建世界模型来理解视觉世界及其动态,推动AI的下一个重大进步。
Lumiere的Spacetime diffusion模型旨在创造现实、多样化和连贯的运动视频。
Lumiere通过SpaceTime U-Net架构一次性生成整个视频的时间持续,提高了全局时间一致性。
Lumiere在文本对齐和视频质量方面优于其他模型,提供了更准确和高质量的视频生成。
Lumiere的进展预示着AI在视频制作和世界模拟方面的未来,为创意人才提供了无限可能。
Transcripts
and just like that out of the blue
Google drops its latest AI tool Lumiere
Lumiere is at its core a text to video
AI model you type in text and the AI
neural Nets translate that into video
but as you'll see Lumiere is a lot more
than just text to
video it allows you to animate existing
images creating video and the style of
that image or painting as well as things
like video in painting and creating
specific animation sections within
images so let's look at what it can do
the science behind it Google published a
paper talking about what they improved
and I'll also show you why the
artificial AI brains that generate these
videos are much more weird than you can
imagine so this is lumere from Google
research A Spacetime diffusion model for
realistic video generation we'll cover
SpaceTime diffusion model a bit later
but right now now this is what they're
unveiling so first of all there's text
to video this is the video that are
produced by various prompts like US flag
waving on massive Sunrise clouds funny
cute pug dog feeling good listening to
music with big headphones and Swinging
head Etc snowboarding Jack Russell
Terrier so I got to say these are
looking pretty good if these are good
representations of the sort of style
that we can get from this model this
would be very interesting so for example
take a look at this one astronaut on the
planet Mars making a detour around his
base this is looking very consistent
this looks like a tablet this looks like
a medicine tablet of some sort floating
in space but I got to say everything is
looking very consistent which is what
they're promising in their research it
looks like they found a way to create a
more consistent shot across different
frames temporal consistency as they call
it here's image to video so as you can
see that this is nightmarish but that's
that's the scary looking one but other
than that everything else is looking
really good so they're taking IM images
and turning them into animations little
animations of a bear walking in New York
for example Bigfoot walking through the
woods so these were started with an
image that then gets animated these are
looking pretty good here are the Pillars
of Creation animated right there that's
uh pretty neat kind of a 3D structure
they're showing styliz generation so
using a Target image to kind of make
something colorful or animated take a
look at this elephant right here one
thing that jumps out at me is it is very
consistent there's no weirdness going on
in a second we'll take a look at other
leading AI models that generate video
and I got to say this one is probably
the smoothest looking one here's another
one so as you can see here here's the
style reference image so they want this
style and then they say a bear twirling
with delight for example right so then
it creates a bear twirling with delight
or a dolin leaping out of the water in
the style of this image here's the same
or similar prompts with this as the
style reference now this as a the style
reference I got to say it captures the
style pretty well here's kind of that
neon phosphorus glowing thing and they
introduce A Spacetime unit architecture
and we'll look at that towards the end
of the video but basically it sounds
like it creates sort of the idea of the
entire video at once so while other
models it seems like kind of go frame by
frame this one has sort of an idea of
what the whole thing is going to look
like at the very beginning and there's a
video stylization so here's a lady
running this is the source video and the
various craziness that you can make her
into the same thing with a dog and a car
and a bear cinemagraphs is the ability
to animate only certain portions of the
image like the smoke coming out of this
train this is something that Runway ml I
believe recently released and looks like
Google is hot on their heels creating
basically the same ability then we have
video and painting So if a portion of an
image is missing you're able to use AI
to sort of guess at what that would look
like I got to say so here where the hand
comes in that is very interesting cuz
that seems kind of advanced cuz notice
in the beginning he throws the Green
Leaf in the missing portion of the image
and then you see him coming back to the
image that we can see throwing a green
leaf or two so it makes the assumption
that hey the things there will also be
green leaves interestingly enough though
I do feel like I can spot a mistake here
the leaves that are already on there are
fresh looking as opposed to the cooked
ones like they are on this side so it
knows to put in the green leaves as the
guy is throwing them for them to be
fresh because it matches the fresh
leaves here but it misses the point that
hey these are cooked leaves and these
are fresh but still it's very impressive
that it's able to sort of to sort of
guess at what's happening in that moment
and this is where if you've been
following some of the latest AI research
this is where these neural Nets get a
little bit weird well again come back to
that at the end but how they are able to
predict certain things like what happens
here for example like no one codes it to
know that this is probably a cake of
some sort nobody tells it what this
thing is it guesses from clues that it
sees on screen but how does that is
really really weird let's just say that
this is pretty impressive so here we're
able to change the clothes that the
person is wearing throughout these shots
while you know notice the hat and the
face they kind of remain consistent
across all the shots whereas the dress
is changed based on a text prompt as you
watch this think about where video
production for movies and serial TV
shows Etc where that's going to be in 5
to 10 years will something like this
allow everyday people sitting at home to
create stunning Hollywood style movies
with whatever characters they want
whatever settings they want with'
generated video and AI voices we can
create a movie starting Hugh Hefner as a
chicken for example so really fast this
is another study called Beyond surface
statistics out of hardw so this has
nothing to do with the Google project
that we're looking at but this paper
tries to answer the question of how do
these models how do they create images
how do they create videos as you can see
here it says these models are capable of
synthesizing high quality images but it
remains a mystery how these networks
transform let's say the phrase car in
the street into a picture of a car in a
street so in other words when we type in
this when a human person says draw a
picture of a car in a street or a video
of a car in a street how does that thing
do it how does it translate that into a
picture do they simply memorize
superficial correlations between pixel
values and words or are they learning
something deeper such as the underlying
model of objects such as cars roads and
how they are typically positioned and
there's a bit of a argument going on in
the scientific Community about this so
some AI scientists say all it is is just
sort of surface level statistics they're
just memorizing where these little
pixels go and they're able to kind of
reproduce certain images Etc and some
people say well no there's something
deeper going on here something new and
surprising that these AI models are
doing so what they did is they created a
model that was fed nothing but 2D images
so images of cars and people and ships
Etc but that model it wasn't taught
anything about depth like depth of field
like where the foreground of an image is
or where the background of an image is
it wasn't taught about what the focus of
the image is what a car is ETC and what
they found is so here's kind of like the
decoded image so this is kind of how it
makes it from step one to finally step
15 where as you can see you can see this
is a car so a human being would be able
to point at this and say that's a car
what in the image is closest to you the
person taking the image you say well
probably this wheel is the closest right
this is the the kind of the foreground
this is the main object and that's kind
of the background that's far far away
and this is close right but the reason
that you are able to look at this image
and know that is because you've seen
these objects in the real world in the
3D world you can probably imagine how
this image would look if you're standing
off the side here looking at it from
this direction this AI model that made
this has no idea about any of that all
it's seeing is a bunch of these 2D
images just pixels arranged in a screen
and yet when we dive into try to
understand how it's building these
images from scratch this is what we
start to notice so early on when it's
building this image this is kind of what
the the depth of the image looks like so
very early on it knows that sort of this
thing is in the foreground it's closer
to us and this right here the blue
that's the background it's far from us
now looking at this image you can't
possibly tell what this is going to be
you can't tell what this is going to be
till much much later maybe here we can
kind of begin to start seeing some of
the lines that are in here but that's
about it you you see like the wheels and
maybe you could guess of what that is
but here in the beginning you have no
idea and yet the model knows that
something right here is in the
foreground something's in the background
and towards the end it knows that this
is closer this is close and this is far
this is Salient object meaning like what
is the focus what is the main object so
it knows that the main object is here it
doesn't know what a car is it doesn't
know what an object is it just knows
like this is the the focus of the image
again only towards much later do we
realize that yes in fact this is the car
and so this is the conclusion of the
paper our experiments provide evidence
that stable diffusion model so this is
an image generating model AI although
solely trained on two-dimensional images
contain an internal linear
representation related to scene geometry
so in other words after seeing thousands
or millions of 2D images inside its
neural network inside of its brain it
seems like and again this is a lot of
people sort of dispute this but some of
these research makes it seem like it's
developing its neural net that allows it
to create a 3D representation of that
image even though it's never been taught
what 3D means it uncovers a salent
object or sort of that main Center
object that it needs to focus on versus
the background of the image as well as
information related to relative depth
and these representations emerge early
so before it starts painting the colors
or the little shapes or the the wheels
and the Shadows it first starts thinking
about the 3D space on which it's going
to start painting that image and here
they say these results add a Nuance to
the ongoing debates and there are a lot
of ongoing debates about this about
whether generative models so these AI
models can learn more than just surface
statistics in other words is there some
sort of understanding that's going on
maybe not like human understanding but
is it just statistics or is there
something deeper that's happening and
this is Runway ml so this is the other
one of the leading sort of text 2 image
AI models and you might have seen the
images so as you can see here this is
what they're offering people have made
full movies maybe not hour long but
maybe 10 minutes 20 minute movies that
are entirely generated by AI so as you
can see here it's it's similar to what
Google is offering although I got to say
after looking at Google's work and then
this one Google's does seem just a
little bit more consistent I would say
there seems to be a little bit less
shifting and and shapes going on it's
just a little bit more consistent across
time time and they have a lot of the
same thing like this stylization here
from a reference video to this image
that's like the style reference but the
interesting thing here is this is in the
last few months looks like December 2023
Runway nml introduced something they
call General World models and they're
saying we believe the next major
advancement in AI will come from systems
that understand the visual world and its
Dynamics they're starting a long-term
research effort around what they call
General World models so their whole idea
is that instead of the video AI models
creating little Clips here and there
with little isolated subjects and
movements that a better approach would
be to actually use the neural networks
and them building some sort of a world
model to understand the images they're
making and to actually utilize that to
have it almost create like a little
world so for example if you're creating
a clip with multiple characters talking
then the AI model would actually almost
simulate that entire world with the with
the rooms and the people and then the
people would talk talk to each other and
it would just take that clip but it
would basically create much more than
just a clip like if a bird is flying
across the sky it would be simulating
the wind and the physics and all that
stuff to try to capture the movement of
that bird to create realistic images and
video so they're saying a world model is
an AI system that builds an internal
representation of an environment and it
uses it to simulate future events within
that environment so for example for Gen
2 which is their model their video model
to generate realistic short video it has
developed some understanding of physics
and motion card still very limited
struggling of complex camera controls or
object motions amongst other things but
they believe and a lot of other
researchers as well that this is sort of
the next step for us to get better at
creating video at teaching robots how to
behave in the physical world like for
example the nvidia's foundation agent
then we need to create bigger models
that simulate entire worlds and then
from those worlds they pull out what we
need whether that's an image or text or
a robot's ability to open doors and pick
up objects all right but now back to
Lumiere A Spacetime diffusion model for
video generation so here they have a
number of examples for the text to video
of image to video stylized generation
Etc and so in lumier they're trying to
build this text video diffusion model
that can create videos that portray
realistic diverse and coherent motion a
pivotal challenge in video synthesis and
so the new thing that they introduces
the SpaceTime unet archit tecture that
generates entire temporal duration of
the video at once so in other words it
sort of thinks through how the entire
video going to look like in the
beginning as opposed to existing video
models so other video models which
synthesize distant key frames followed
by temporal super solution basically
meaning they do it one at a time so they
start with one and then create the
others and they're saying that makes
Global temporal consistency difficult
meaning that the object as as you watch
a video of it right it looks a certain
way on the first second of the video but
by second five is just completely
different and so here basically they're
comparing these two videos so imagin and
rs so The Lumiere model as you can see
here here sample a few clips and they're
looking at the XT slice so the XT slice
you can basically think of that as so
for example in stocks you have you know
the price of stock over time right so it
kind of goes like this here the x is the
spatial Dimension so where certain
things are in space on the image versus
T temporal the time so the X here is
basically where we might be looking at
the width of the image for example of
any image in time and T the temporal is
like how consistent is across time so as
you can see hit this green line so we're
just looking at this thing across the
entire image and this is what that looks
like so as you can see here this is
going pretty well and then it kind of
messes up and it kind of gets crazy here
and then kind of goes back to doing okay
whereas in Lumiere it's pretty pretty
good I mean maybe some funkiness right
there in one one frame but it's pretty
good same thing here I mean this is as
you can see here pretty good maybe you
can say that there's a little bit of
funkiness here but overall it's very
good whereas in this image and video I
mean as you can see here there's kind of
like a lot of nonsense that's happening
right and so here you can see like you
can't tell how many legs it has if it's
missing a leg Etc whereas in The Lumiere
I mean I feel like the you know you can
see each of the legs pretty distinctly
and their position and it's remains
consistent across time or at least
consistently easy to see where they are
but I got to say I can't wait to get my
hands on it it looks like as of right
now I don't see a way to access it this
is just sort of a preview but hopefully
they will open up for testing soon and
we'll be able to get our hands on it and
check it out and here interestingly
enough they actually compare how well
their performs against the other
state-of-the-art models in the in the
industry so the two that I'm familiar
with is Pika and genan 2 those are the
two that I've used and they're saying
that their video video is preferred by
users in both text to video and image to
video generation so blue is theirs and
the Baseline is the orange one so it
seems like there are pretty big
differences in every single one this
seems like video quality I mean it beats
out every single other one of these
which which I believe this text
alignment which here means probably how
well the image how true it is to The
Prompt right so if you type in a prompt
how accurately it represents it so it
looks like maybe image is the closest
one but it beats out most of the other
ones by quite a bit and then video
quality of image to video it seems like
it beats them out as well with Gen 2
probably being the next best one and
here they provide a side-by-side
comparison so for example the first
prompt is a sheep to the right of a wine
glass so this is Pika which which not
great CU there's no wine glass here's
Gen 2 consistently putting it on the
left anime diff which just has two
glasses and maybe a reflection of a
sheep image and video same thing so the
glasses on the left zero scope no
glasses that I can see although they
have sheep and of course R so the Lumi
the Google one is it seems like a nail
it in every single one the glass is on
the right although I got to say Gen 2 is
is great although it confused the left
and right but other than that I mean
same if image and video actually
although I feel like Gen 2 the quality
is much better of the sheep cuz that's
you know that's a good-looking sheep I
should probably rephrase that that's a
well rendered sheep how about that
versus imagin I mean that's a weird
looking thing there that could almost be
a horse or a cow if you just look at the
face and Google is again excellent
here's teddy bear skating in Time Square
this is Google this is imag again
weirdness happening there and that's gen
two again pretty good but I mean the the
thing is facing away although here I
just noticed so they they took skating
to mean ice skates whereas here it looks
like these are roller skates skateboard
Etc and so it looks like in the study
they just showed you two to things they
say do you like the left or the right
more based on motion and better quality
well I got to say if you're an aspiring
AI cinematographer then this is really
good news consistent coherent images
that are able to create near lifelike
scenes at this point I mean I'm sure
there's other people that'll complain
about stuff but you got to realize how
quickly the stuff is progressing just to
give you an idea this is about a year
ago or so this is what a I generated
video looked like so can you tell that
is improved just a little bit that's
about a year I'm not sure exactly when
this was done but I'm going to say a
year year and a half ago and I mean this
thing gets nightmarish so when I'm
talking about weird blocky shapes things
not being consistent across scenes like
what are we even looking at
here is this a mouth is this a building
and here's kind of uh something from
about 4 months ago from Pika Labs so as
you can see here it's much better it's
much more consistent right as you can
see here humans again maybe they look a
little bit weird but it's better it can
put you in the moment if you're telling
a story that's not necessarily about
everything looking realistic something
like this can be created pretty easily
and since it's new it's novel people
might be this might be a whole new
movement a new genre of film making
that's new exciting and never before
seen and most importantly it's easy to
create with a you know at home with a
few AI tools and anybody out there with
creative abilities with creative talent
to tell the stories that they have in
their mind without being limited
financially by Capital they're going to
be able to create AI voices they're
going to be able to create AI footage
maybe even have Chad GPT help them with
some of the story writing and once more
the sort of the next generation of
things that we're seeing that people are
working on is things like the similation
where you create the characters and then
you sort of let them loose in a world
they get simulated with these they get
sort of simulated so the stories kind of
play out in the world and then you sort
of pick and choose what to focus on
which scenes and which characters you
want to bring to the front so you
basically act as the World Builder you
build the worlds the characters the
narratives and AI assists you in
creating the visuals the voices Etc and
you can be 100% in control of it or you
can only control the things that you
want and the AI generates the rest so to
me this if you're interested in movie
making and you like these sort of styles
that by the way quickly will become much
more realistic I would be really looking
at this right now because right now is
the time that it's sort of emerging into
the world and getting really good and
it's going to get better by next year
it's going to be a lot
better well my name is Wes rth and uh
thank you for watching
浏览更多相关视频
Sora AI出场即巅峰,ChatGPT实现全面统治 | Sora视频生成模型能力详解
[ML News] Jamba, CMD-R+, and other new models (yes, I know this is like a week behind 🙃)
【AI】2024年最强AI视频生成工具TOP 5,其中三款完全免费
OpenAI shocks the world yet again… Sora first look
You Won't Believe OpenAI JUST Said About GPT-5! Microsoft Secret AI, Hallucination Solved, GPT2
OpenAI Sora 创建视频:它是什么、如何使用它、是否可用以及其他问题的解答
5.0 / 5 (0 votes)