Sora Creator “Video generation will lead to AGI by simulating everything” | AGI House Video
Summary
TLDR该视频脚本介绍了一个名为Sora的创新视频生成技术,它通过深度学习和Transformer模型,能够生成高分辨率、一分钟长的视频。Sora技术不仅推动了内容创作的革新,还被视为通向人工智能的重要一步。视频中展示了Sora生成的多样化视频样本,包括复杂的3D场景、物体持久性以及特殊效果等,突显了其在创意表达和模拟现实世界的潜力。同时,讨论了Sora的技术细节,如数据训练、模型扩展性和未来的可能性,以及它在艺术创作和娱乐产业中的潜在应用。
Takeaways
- 🌟 视频生成技术的进步,特别是AI生成的视频,是通向AGI(人工通用智能)的关键路径之一。
- 📹 AI视频生成模型Sora能够创建高清、一分钟长的视频,这在视频生成领域是一个巨大的飞跃。
- 🤖 Sora通过学习理解物理世界的复杂性,包括对象的持久性和3D空间的理解。
- 🎨 Sora的应用前景广泛,包括内容创作、特殊效果和动画制作等,为创作者提供了新的创作工具。
- 🔄 Sora使用Transformer模型,通过在大量不同类型的视频和图像上进行训练,学习生成多样化的内容。
- 🧠 通过增加计算能力和数据,Sora的性能和生成的视频质量将不断提升。
- 🎥 Sora在生成复杂场景和动物行为方面表现出色,显示出对物理世界的深入理解。
- 🏙️ Sora能够生成具有一致3D几何结构的高质量航拍视频。
- 🔄 Sora展示了在视频中插值不同对象的能力,创造出无缝过渡的视觉效果。
- 📈 Sora项目的目标是提升视频生成的质量,同时保持方法的简单性和可扩展性。
- 🚧 尽管Sora取得了显著进展,但在处理某些物理交互和基础物理概念方面仍有待提高。
Q & A
Aji house是如何尊重像在场的各位这样的人物的?
-Aji house通过邀请像在场的各位这样的人物参与,共同创造和体验新技术,来表达对他们的尊重和重视。
视频中提到的Sora项目是什么?
-Sora是一个视频生成项目,它利用团队和众多杰出合作者的力量,通过高级技术手段,能够创造出具有复杂细节和物理世界理解的视频内容。
Sora项目在视频生成方面取得了哪些成就?
-
Outlines
🎬 介绍与Aji House的合作
本段落介绍了Aji House的合作项目,强调了与合作伙伴的共同价值观和目标。提到了与Bill和Tim的合作,以及他们与一支卓越的团队一起,致力于开发能够对内容创作产生影响的技术。此外,还提到了一段特别的视频,展示了视频生成技术的进步,如视频的清晰度、时长和复杂性。讨论了视频生成中的一些技术挑战,例如对象的持久性和一致性问题,以及如何通过训练视频理解物理世界。
🌟 视频生成技术的潜力与应用
这一部分讨论了视频生成技术的潜力,特别是在内容创作和人工智能(AGI)方面的短期和长期影响。提到了视频生成技术如何帮助创造全新的媒体和娱乐形式,以及如何使那些没有专业工具的人能够实现他们的创意。还展示了一些艺术家使用这项技术创作的样本,强调了技术的多样性和创造性。此外,还讨论了视频生成技术背后的一些技术细节,如Transformer模型和如何通过扩展计算能力来提高模型性能。
📈 技术进步与模型扩展
本段落重点介绍了视频模型的技术进步,包括如何处理不同分辨率和宽高比的视频,以及如何通过零样本学习来实现视频编辑和风格转换。讨论了视频模型如何通过插值和变换来创造新的视觉效果,以及如何通过扩展视频生成系统来提高其性能和创造力。此外,还提到了视频模型在模拟现实世界物理规律方面的潜力,以及如何通过学习不同世界的规则来增强模型的能力。
🚀 视频模型对AGI的贡献
这一部分讨论了视频模型如何有助于实现人工通用智能(AGI)。强调了视频模型通过模拟人类互动和物理接触来理解人类行为的重要性。提到了通过扩展模型来提高其对复杂场景的理解能力,以及如何通过学习动物行为和3D一致性来增强模型。此外,还讨论了视频模型在模拟不同世界(如Minecraft)方面的潜力,以及如何通过学习各种规则和对象来提高模型的泛化能力。
🛠️ 技术挑战与未来展望
本段落讨论了视频模型在技术实现上面临的挑战,如处理某些物理交互的困难,以及如何通过与艺术家和安全专家的合作来改进模型。提到了艺术家对模型控制性的需求,以及如何通过微调来适应特定的角色或IP。还讨论了视频模型的生成过程,包括如何通过去噪来创建视频,以及如何通过生成较短的视频然后扩展来提高效率。最后,提到了对AGI的展望,以及如何通过创造性地解决问题来克服数据和技术上的限制。
Mindmap
Keywords
💡AI
💡视频生成
💡物体持久性
💡3D几何
💡内容创作
💡Transformer模型
💡扩散模型
💡零样本学习
💡交互式视频
💡数据集
Highlights
Aji House 尊重像你们这样的人们,这是我们邀请你们来到这里的原因。
我们与一支卓越的合作团队一起,通过开放的方式制作了 s,今天我们很兴奋地向你们介绍一些关于它的信息。
我们将会讨论它的高级功能、对内容创作的影响、背后的技术以及为什么这是走向 AI 的重要一步。
我们生成的视频是高清的,并且长达一分钟,这是我们制作视频时的一个重大目标。
视频中的复杂性,如反射和阴影,以及物体的持久性和一致性,是视频生成中一个非常难解决的问题。
Sora 不仅能够生成内容,而且还从训练视频中学习到了关于物理世界的大量智能。
视频生成为我们提供了革命性的内容创作机会,我们对此感到非常兴奋。
Sora 能够生成具有不同风格的视频,例如纸艺世界,这非常酷。
Sora 能够理解完整的 3D 空间,因此当摄像机在 3D 中移动时,它真正理解了几何和物理复杂性。
Sora 的一个样本是一个电影预告片,展示了一个 30 岁的宇航员的冒险,这个宇航员在多个镜头中持续出现,这是通过 Sora 生成的。
Sora 能够为特殊效果带来很多创新,这是我们最喜欢的样本之一,一个自然融入纽约市的外星人。
这项技术将为好莱坞传统 CGI 流程中通常非常昂贵的特效带来很多影响。
Sora 能够生成动画内容,这是我们最喜欢的一个样本,它有一点魅力。
Sora 能够生成传统好莱坞基础设施难以拍摄的场景,例如纽约市的 Blom Zoo 商店。
我们与艺术家合作,探索 Sora 的可能性,这是我们研究的一部分,我们认为这是了解这项技术价值的最佳方式。
Sora 正在学习如何模拟真实世界的物理,这是我们项目的关键组成部分,但我们也可以模拟其他类型的世界。
我们认为 Sora 就像视频的 GPT-1,我们相信这项技术将很快变得更好,并且人们将在此基础上构建令人惊叹的创新。
Transcripts
here's here's one thing about Aji house
we honor the people like you guys that's
why we have you here that's why you're
all here without the further Ado we're
[Applause]
going awesome what a big fun craft i'm
Tim this is Bill and we made s at open
ey together with a team of amazing
collaborators we're excited to tell you
a bit about it today we'll talk a bit
about at a high level what it does some
of the opportunities it has to impact
content creation some of the technology
behind it as well as why this is an
important step on the path to
AI so without further Ado here is a sord
generated video this one is really
special to us because it's HD and a
minute long and that was a big goal of
ours when we're trying to figure out
like what would really make a leap for
video generation you want 1080 fet
videos that are a minute long this video
does that you can see it has a lot of
complexity too like in the reflections
and the Shadows one really interesting
point that sign that blue sign she's
about to walk in front of
it and after she passes the sign still
exists afterwards that's a really hard
problem for video generation to get this
type of object permanence and
consistency over a long
duration so I can
do here we go a number of different St
Styles too this is a paper craft world
that I can imagine so that's really cool
and and also it can learn about a full
3d space so here the camera moves
through 3D as people are moving but it
really understands the geometry and the
physical complexities of the so Sor
learned a lot in addition to just being
able to generate content it's actually
learned a lot of intelligence about the
physical world just from training when
videos and now we'll talk a bit about
some of the opportunities with video
generation for revolutionizing creation
alluded to we're really excited about
Sora not only because we view it as
being on the critical path towards AGI
but also in the short term for what it's
going to do for Content so this is one
sample we like a lot so the front in the
bottom left here
is a movie trailer featuring The
Adventures of the 30-year-old Spaceman
hardest part of doing video by the way
is always just getting PowerPoint to
work with
it there we
[Music]
go all right now we're what's cool about
this sample in particular is that this
astronaut is persisting across multiple
shots which are all generated by we
didn't Stitch this together we didn't
have to do a bunch of outakes and then
create a composite shot at the end sword
decides where it wants to change the
camera but it do know that it's going to
put the same astronaut in a bunch of
different environments likewise we think
there's a lot of cool implications for
special effects this is one of our
favorite samples too an alien blending
in naturally New York City paranoia
Thriller style 35 mil and already you
can see that the model is able to create
these very Fantastical effects which
would normally be very expensive in
traditional CGI pipelines for Hollywood
there's a lot of implications here for
what this technology is going to bring
shortterm of course we can do other
kinds of effects too so this is more of
a Sci-Fi scene so it's a scuba diver
discovering a hidden futuristic
shipwreck with cybernetic marine life
and advanced alien
[Music]
technology as someone who's seen so much
incredible content from people on the
internet who don't necessarily have
access to tools like Sora to bring their
Visions to life they come up with like
cool phosy post them on Reddit or
something it's really exciting to think
about what people are going toble to do
with this
technology of course it can do more than
just photo realistic style you can also
do animated content
really ADOT my favorite part of this one
is the spell
otter a little bit of
charm and I think another example of
just how cool this technology is is when
we start to think about scenes which
would be very difficult to shoot with
traditional Hollywood kind of
infrastructure the problems here is the
Blom Zoo shop in New York City is with
the jewelry store and Zoo saber-tooth
tigers with diamond and gold adornments
Turtles with listening Emerald shells
Etc and what I love about this shot is
it's photo realistic but this is
something that would be incredibly hard
to accomplish with traditional tools
that they have in Hollywood today this
kind of shot would of course require CGI
it would be very difficult to get real
world animals into these kinds of scenes
but with Sora is pretty trivial and it
can just do it
on so I'll hand it over to Tim to chat a
bit about how we're working with artists
today with sort to see what they're able
to do yeah so we just came out with
pretty recently we given access to a
small pool of artists and maybe even to
take a step back this isn't yet a
product or something that is available
to a lot of people it's not in chat TBT
or anything but this is research that
we've done and we think that the best
way to figure out how this technology
will be valuable and also how to make it
safe is to engage with people external
from oration so that's why we came out
with this announcement and when we came
out with the announcement we started
working with small teams of red teamers
who helped with the safety work as well
as artist and people who will use this
technology so Shai kids is one of the
artists that we work with and I really
like this quote from them as greatest
Sora is at generating things that appear
real what excites us is the ability to
make things that are totally surreal and
I think that's really cool because when
you immediately think about oh
generating videos we have all these
existing uses of videos that we know of
in our lives and we quickly think about
turning those oh maybe stock videos or
existing films but what's really
exciting to me is what totally new
things are people into what completely
new forms of media and entertainment and
just new experiences for people that
we've never seen before are going to be
enabled by by Sora and by Future
versions of video and media generation
technology and now I want to show this
fun video that shy kids made using Sor
when we gave access to
them oh okay it has audio unfortunately
I guess we don't have that hooked up
it's this cute
plot about this guy with the balloon
head you should really go and check it
out we came out with this blog post Sora
First Impressions and we have videos
from a number of artists that we've
access to and there's this really cute
monologue to his guys talking about life
from the different perspective of me as
a guy with a balloon head right and this
is just awesome and so creative and the
other artists we've been access to have
done really creative and totally
different things from this too like the
way each artist uses this is just so
different from each other artist which
is really exciting because that says a
bit about the breath of ways that you
can use this technology but this is just
really fun and there are so many people
with such brilliant ideas as Phil was
talking about that maybe it would be
really hard to do things like this or to
make their film or their thing that's
not a film that's totally new and
different and hopefully this technology
will really democratize content Creation
in the long run it Ena so many more
people with creative ideas to be able to
bring those to life and show
them I'm want to talk a bit about some
of the technology behind soil so I'll
talk about it from the perspective of
language models and what has made them
work so well is the ability to scale and
better lesson that methods that improve
with scale in the long run are the
methods that will win out as you
increase compute because over time we
have more and more compute and if
methods utilize that well then they will
get better and better and language
models are able to do that in part
because they take all different forms of
text you take math and code and fros and
whatever is out there and you turn it
all into this universal language of
tokens and then you train
these big Transformer models on all
these different types of tokens this
this kind of universal model of Text
data and by training on this vast array
of different types of text you learn
these really generalist models of
language you can do all these things
right you can use chat gbt or whatever
your favorite language model is to do
all different kinds of tasks and it has
such a breadth of knowledge that it's
learn from the combination of this
variety of data and we want to do the
same thing for visual here that's
exactly what we did with Sora so we take
vertical videos and images and square
images low resolution high resolution
wide aspect ratio and we turn them into
patches and a patch is this little cube
in SpaceTime that you can imagine a
stack of frames a video is like a stack
of images that are all the frames and we
have this volume of pixels and then we
take these little cubes from inside and
you can do that on any volume of pixels
whether that's a high resolution image a
low resolution image regardless of the
aspect ratio long videos short videos
you turn all of them into these
SpaceTime patches and those are our
equivalent of tokens and then we train
Transformers on these SpaceTime patches
and Transformers are really scalable and
that allows us to think of this problem
in the same way that people think about
language problems of how do we get
really good at scal them and making
methods such that as we increase the
compute as we increase the data they
just get better and
better training on multiple aspect
ratios also allows us to generate with
multiple aspect
ratios there we go so here's the same
prompt and you can generate vertical and
square horizontal that's also a nice it
in addition to the fact that allows you
to use more data which is really Valu
you want to use all the data in its
native format as it exists it also gives
you more diverse ways to use the V so I
actually think vertical videos are
really nice like we look at content all
the time on our phones right so it's
nice to actually be able to generate
vertical and horizontal and a variety of
things and we can also use zero shot
some videoo video capabilities so this
uses a method which is a method that's
commonly used with diffusion we can
apply that our model uses to Fusion
which means that it Den noises the video
starting from noise
in order to create the video itely noise
so we use this method called sedit and
apply it and this allows us to change an
input video the offer left it's all
generated but it could be a real image
then we say rewrite the video in pixel
art style put the video in space with
the Rainbow Road or change the video to
a Medieval theme and you can see that it
edits the video but it keeps the
structure the same so in in a second it
will go through a tunnel for example and
it interprets that tunnel in all these
different ways this medieval one is
pretty amazing right because the model
is also intelligent so it's not just
changing something shallow about it but
it's medieval we don't really have a car
so I'm going to make a horse
scage and another fun capability that
the model has is to interpolate between
videos so here we have two different
creatures and this video in the middle
starts with the left and it's going to
end with the right and it's able to do
it in this really
seamless and amazing
[Music]
way so I think something that the past
slide in this slide really point out is
that there are so many unique and
creative things you can potentially do
with these models and similar to how
when we first had language models
obviously people were like oh like you
can use it for writing right okay yes
you can but there are so many other
things you can do with language models
and we're only we're even now like every
day people coming up with some creative
new cool thing you can do with the
language model the same thing's going to
happen for these visual models there are
so many creative interesting ways in
which we can use them and I think we're
only starting to scratch the surface of
what we can do with
them here's one I really love so there's
a video of a drone on the left and this
is a
butterfly underwater on the right and
we're going to interpolate between the
two
and some of the Nuance it gets like for
example that it makes the coliseum in
the middle as it's going slowly start to
Decay actually and going like it's
really spectacular some of the Nuance
that it gets right and and here's one
that's really cool too because it's like
how can you possibly go in the kind of
Mediterranean landscape to this
gingerbread house in a way that is like
consistent with physics in the 3D world
and it comes up with this really unique
solution to do it that it's actually
uded by the building and behind it you
start to see thiser red
[Music]
house so I encourage you if you haven't
we have in addition to when we released
our main blog post we also came up with
a technical report and the technical
report has these examples and it has
some other cool examples that we don't
have any these slides too again I think
it's really scratching the surface of
what we could do with these models but
check that out if you haven't there are
some other fun things you can do like
extending videos forward or backwards I
think we have here one example where
this is an image we generated this one
with Dolly 3 and then we're going to
animate this image using
Sora
sh all right now I'm going to pass it
off to Bill to talk a bit about why this
is important on the path to
AI all right of course everyone's very
bullish on the rule that llms are going
to play in getting to AGI but we believe
that video models are on the critical
path to it and concrete
we believe that when we look at very
complex scenes of Sor Genera like that
snowy scene in Tokyo that we saw at the
very beginning that Sora is already
beginning to show a detailed
understanding of how humans interact
with one another how they have physical
contact with one another and as we
continue to scale this Paradigm we think
eventually it's going to have to model
how humans think right the only way you
can generate truly realistic video with
truly realistic sequences of actions is
if you have an internal model of how all
objects humans Etc environments work and
so we think this is how Sora is going to
contribute to AI so of course the name
of the game here as it is with LM is
scaling and a lot of the work that we
put into this Paradigm in order to make
this happen was as Tim alluded to
earlier coming up with this Transformer
based framework that scals really eff
and so we have here a comparison of
different s models where the only
difference is the amount of training
compute that we put into the model so on
the far left there you can see Sora with
the base amount of compute it doesn't
really even know how dogs look it has a
rough sense that like camera should move
through scenes but that's about it if
you forx the amount of compute that we
put in for that training one then you
can see it now a she know what's like
can put a hat on it and it can put a
human in the background and if you
really crank up the compute and you go
to 32x base then you begin to see these
very detailed Textures in the
environment you see this very last
movement with the feet and the dog's
legs as it's navigating through the
scene you can see that the woman's hands
are beginning to interact with that
mided hat and so as we continue to scale
up Sora just as we find emergent
capabilities in llms we we believe we're
going to find emerging capabilities and
video models as well and even with the
amount of compute that we put in today
not that 32x Mark we think there's
already some pretty cool things that are
happening so I'm going to spend a bit of
time talking about that so the first one
is complex scenes and animals so this is
another sample for this beautiful snowy
Tokyo City M and again you see the
camera flying through the sea it's
maintaining this 3D geometry this
couple's holding hands you can see
people at the Stalls it's able to
simultaneously model very complex
environment with a lot of agents in it
so today can only do pretty basic things
like these fairly like lowle
interactions but as we continue to scale
the model we think this is indicative of
what we can expect in the future you
know more kind of conversations between
people which are actually substantive
and meaningful and more complex physical
interactions another thing that's cool
about video models compared to llms is
we can do anal got a great an here
there's a lot of intelligence Beyond
humans in this world and we can learn
from all that intelligence we're not
limited to one notion of it and you can
do animals we can do dogs we really like
this one this is a dog in barano Italy
and you can see it's wants to just go to
that other window s stumbles a little
bit but it recovers so it's beginning to
build the model not only about for
example humans and local through scenes
but how any
animal another property that we're
really excited about is this notion of
3D consistency so there was I think a
lot of debate at one point within the
academic Community about the extent to
which we need inductive biases and
generative models to really make them
successful and with Sora one thing that
we wanted to do from the beginning was
come up with a really simple and
scalable framework that completely assu
any kind of hard-coded inductive biases
from humans about physics and so what we
found is that this works so as long as
you scale up the model enough it can
figure out 3D geometry all by itself
without us having to bake and break
consistency into the model
correctly so here's an aerial view of
during the blue hour showcasing the
stunning architecture of white psychotic
buildings with Blue Domes and all these
aerial shots we found T to be like
pretty successful this s like you don't
have to cherry pick too much to get it
really does a great job at consistently
coming up with good results
here aerial view of Y both hikers as
well as a g water pole they do some
extreme hiking
at
[Music]
so another property which has been
really hard for video generation systems
in the past but Sora has mostly figured
out it's not perfect is object
permanence and so we can go back to our
favorite little scene of the Dalmation
in barano and you can see even as a
number of people pass by
it to dog still there so Sora not only
gets these kind of very like shortterm
interactions direct like saw earlier
with the woman passing by the blue sign
in Tokyo but even when you have multiple
levels of refusion can still
Rec in order to have like a really
awesome video generation system by
definition what you need is for there to
be non-trivial and really interesting
things that happen over time in the old
days when we were generating like 4C
videos uh usually all we saw were like
very light animated gifs that was what
most video generation systems were
capable of and Sora is definitely a step
forward and now we're beginning to see
signs that you can actually do like
actions that permanently affect the
world State and so this is i' say one of
like the weaker aspects of sord today it
doesn't nail this 100% of the time but
we do see Lems of success here so I'll
share a few here so this is a watercolor
painting and you can see that as the
artist is leaving brush Strokes they
actually skip with the canvas so they're
actually able to make a meaningful
change to the world and you don't just
get kind of like a blurry
nothing so this older man with hair is
devouring a cheeseburger wait for it
there we go so he actually Lees a bite
in it so these are very simple kinds of
interactions but this is really
essential for video generation systems
to be useful not only for Content
creation but also in terms of AGI and
being able to model long range
dependencies if someone does something
in the distant past and you we want to
generate a whole movie we need to make
sure the model can remember that and
that state is affected over time so this
is a step for that with
s
when we think about Sora as a world
simulator of course we're so excited
about modeling our real world's physics
and that's been a key component of this
project but at the same time there's no
real reason to stop there so there's
lots of other kinds of Worlds right
every single laptop we use every
operating system we use has its own set
of physics it has its own set of
entities and objects and rules and Sora
can learn from everything it doesn't
just have to be a real world physics
simulator so we're really excited about
the prospect ass simulating literally
everything and as a first step towards
that
we tried Minecraft so this is Sora and
the prompt is Minecraft with the most
gorgeous highres akk texture pack ever
and you can see already Sora knows a lot
about how Minecraft works so it's not
only rendering this environment but it's
also controlling the player with the
reasonably intelligible policy it's not
too interesting but it's doing something
and it can model all the objects in the
scene as well so we have another sample
with the sand
promps it shows a different texture pack
this time and we're really excited about
this notion that one day we can just
have a singular model which really can
encapsulate all the knowledge across all
these world so one joke we like to say
is you can run CHT in the video model
eventually and now let TR a bit about
failure cases so of course Sora has a
long way to go this is
really Sora has a really hard time with
certain kinds of physical interactions
still today that we would think as being
very simple so like share object in Sor
mind even simpler kinds of physics than
this if you drop a glass and shatter if
you try to do a sample like that s
will'll get it wrong almost time so it
really has a long way to go and
understanding very basic things that we
take for granted so we're by no means
anywhere near the end of this yet and to
wrap up we have a bunch of samples here
and we go to questions I think overall
we're really excited about where this
Paradigm is
going
we don't know
next to extend
it so we really view this as being like
the gpt1 of video and we think this
technology is going to get a lot better
very soon there's some signs of life and
some cool properties we're already
seeing like I just went over um but
we're really excited about this we think
the things that people are going build
on top of Ms like this are going to be
mindblowing and really amazing and we
can't wait to see what the world does
with it so thanks a
lot we have 10 minutes who goes
first all right um so question about
like understanding the agents or having
the agent interact with each other with
in the scene is that piece of
information explicit already or is it
just the P SS and then you have to run
like a can now talk good question so all
this is happening implicitly and so you
know when we see these like Minecraft
samples we don't have any notion of
where it's actually modeling the player
and where it's explicitly representing
actions within the environment so you're
right that if you wanted to be able to
exactly describe what is happening or
somehow read it off you would need some
other system on top of Sora currently to
be able to extract that information
currently it's all implicit in the
princi and emplo for that matter
everything's implicit 3D is implicit
everything is there's no
anything so basically the things that
you just describ right now is all the
cool probabilities derived from model
like
after cool
that's could you talk a little bit about
the potential for fine tuning so if you
have a very specific character or IP I
know for the the wave one you used an
input image for that how do you think
that those plugins
or built into the process yeah great
question so this is something we're
really interested in in general one
piece of feedback we've gotten from
talking with artists is that they just
want them all to be ask controlable as
possible to your point if they have a
character they really love and that
they've designed they would love to be
able to use that across s Generations
it's something that's actively on our
mind you could certainly do some kind of
fine tuning with the model if you had a
specific data set of your content that
you wanted to adapt the model for um we
don't currently we're really in like a
stage where we're just finding out
exactly like what people want so so this
kind of feedback is actually great for
us so we don't have a clear road map for
exactly that might be possible but in
theory it's
probably all right on the back you okay
so language Transformers you're like
pying autor regressively predicting this
like sequential manner but in
Transformers we do like this scanline
order maybe we do like a snake through
the spatial domain do you see this as a
fundamental constraint Vision
Transformers does it matter if you do
does the order at which you predict
token station
matter yeah good question in this case
we're actually using diffusion so it's
not an auto regressive Transformer in
the same way that language models are
but we're Den noising the videos that we
generate so we start from a video that's
entire noise and we iteratively run our
model to remove the noise and when you
do that enough times you remove all the
noise and you end up with a sample and
so we actually don't have this like scan
line order for example because you can
do the denoising across many SpaceTime
Patches at the same time and for the
most part we actually just do it across
the entire video at the same time we
also have a way and we get into this a
bit in that technical report that if you
want to you could first generate a
shorter video and then extend it so
that's also an option but it can be used
in either way either you can generate
the video all at once or you can
generate a shorter video and extended if
you
like yeah so the internet Innovation was
mostly driven by BN do you feel a need
to pay that adult industry
back I feel no need also
yeah all
right do you generate da at 30 Fram
second or do you like frames frame
generation at
that all the four way slower
than we generate 30
FPS okay have you tried like colliding
cars or like rotations and things like
that to see if the image generation F
fits into like a physical model world
that
OBS we've tried a few examples like that
I'd say rotations generally tend to be
pretty reasonable it's by no means
perfect I've seen it couple samples from
Sora of colliding cars I don't think
it's quite got three laws down
yet so what are the IND Ed that you
trying to fix right now with Sora that
your
so the engagement with people external
right now is mainly focused on artists
and how they would use it and what
feedback they have for being able to to
use it and people red teamers on safety
so that's really the two types of
feedback that we're looking for right
now and as Bill mentioned a really
valuable piece of feedback we getting
from artists the type of control they
want for example artists often want
control of the camera and the path of
the camera case also and then on the
safety concerns it's about we want to
make sure that if we were to give wider
access to this that it would be
responsible and safe and there are lots
of potential misuses for it and
disinformation there many concerns Focus
possible to make videos that a user
could actually interact with it like
through V or something so let's say like
video is playing halfway through I stop
it I change a few things around with
video just like Chris would I be able to
rest of the video incorporate those
changes it's a great idea right now Sora
is still pretty slow from the latency
perspective what we generally said
publicly is so it depends a lot on the
exact parameters of the generation
duration resolution if you're cranking
out this thing it's going to take at
least a couple minutes and so we're
still I'd say a way is off from the kind
of experience you're describing but I
think it' be really cool
thanks what were your stated goals in
building this first version and what
were some problems that you had along
the way that you learned
from I'd say the overarching goal was
really always to get to 1080p at least
30 seconds from like the early days of
the project so we felt like video
generation was stuck in the Rut of this
4 second like J generation
and so that was really the key focus of
the team throughout the project along
the way I think we discovered how
painful it is to work with video data
it's a lot of pixels in these videos and
it's a lot of just very detailed boring
engineering work that needs to get done
to really make these systems work and I
I think we knew going into it that it
would involve a lot of elbow grease in
that regard but yeah it certainly took
some time so I don't know any other
findings along the way yeah I mean we
tried really hard to keep the method
really simple and that is sometimes
easier said than done but I think that
was a big focus of just let's do the
simplest thing we possibly can and
really scale it and do the scaling
prop did you do the prom and see the
output it's not good enough then you go
TR again do the same prom and then it's
there that's first video then you do
more than training than the new prom and
new video is that the process you use in
this reling the
videos that's a good question evaluation
is challenging for videos we use a
combination of things one is your actual
loss and low loss is correlated with
models that are better so that can help
another is you can evaluate the quality
of individual frames using image metrics
so we do use standard image metrics to
evaluate the quality frames and then we
also did spend quite a lot of time
generating samples and looking at them
ourselves although in that case it's
important that you do it across a lot of
samples and not just individual prompts
because sometimes this process is noisy
so you might randomly get a good sample
and think that you Improvement so this
would be like you compare Lots ofrs in
the
outputs uh we can't comment on that one
last
question thanks for a great talk so my
question is on the training data so how
much training data do you estimate that
is required for us to get to AGI and do
you think we have enough data on the
internet yeah that's a good question I
think we have enough dat it to get to
AGI and I also think people always come
up with creative ways to improve things
and when we hit limitations we find
creative ways to improve regardless so I
think that whatever data we have will be
enough to get the AG wonderful okay
that's to AI thank
[Applause]
you
浏览更多相关视频
OpenAI shocks the world yet again… Sora first look
Ilya Sutskever | AI neurons work just like human neurons | AGI will be conscious like humans
《與楊立昆的對話:人工智能是生命線還是地雷?》- World Governments Summit
I Spent 6 HOURS Researching The Coding Job Market… Here’s What I Found
【生成式AI導論 2024】第17講:有關影像的生成式AI (上) — AI 如何產生圖片和影片 (Sora 背後可能用的原理)
The Truth About Making Cartoons
5.0 / 5 (0 votes)