Sora Creator “Video generation will lead to AGI by simulating everything” | AGI House Video
Summary
TLDRThe transcript discusses the development of a video generation model named Sora, which aims to revolutionize content creation and contribute to the path towards Artificial General Intelligence (AGI). Sora demonstrates the ability to generate high-definition, minute-long videos with complex scenes and object permanence. The model is trained on a diverse range of visual data, scaling up to improve its capabilities. The potential applications of Sora are vast, from creating realistic special effects to animating content and even simulating different worlds. The developers are engaging with artists and red teamers to refine the technology and ensure its responsible use.
Takeaways
- 🌟 The Aji House team, including Tim and Bill, introduced a new AI-generated video technology that can produce high-definition, minute-long videos with complex details like reflections and shadows.
- 🚀 A significant goal was achieved with the creation of videos that are 1080p and a minute long, marking a leap forward in video generation technology.
- 🎨 The technology allows for various styles, including a paper craft world and the ability to understand and generate content in a full 3D space, showcasing the geometry and physical complexities.
- 🤖 The AI has learned intelligence about the physical world through training on videos, indicating its potential to revolutionize content creation and contribute to the path towards Artificial General Intelligence (AGI).
- 🎬 The technology can generate content with consistent character appearances across multiple shots without the need for manual editing or compositing.
- 🏙️ There are implications for special effects and Hollywood, as the AI can create fantastical effects that would typically be expensive in traditional CGI pipelines.
- 💡 The technology's potential extends beyond photorealistic content, as it can also generate animated content and scenes that would be difficult to shoot with traditional infrastructure.
- 🎨 Artists have been given access to the technology, and their feedback highlights the desire for more control over the generated content, such as camera control and character representation.
- 🔧 The technology is still in the research phase and is not yet a product available to the public, with the team focusing on artist engagement and safety considerations.
- 📈 As with language models, the key to improving the technology is scaling, with the expectation that increased compute and data will lead to better performance and more emergent capabilities.
Q & A
What was the primary goal for the Aji House team in developing their video generation technology?
-The primary goal for the Aji House team was to create high-definition, minute-long videos, marking a significant leap in video generation capabilities.
What challenges did the team face in achieving object permanence and consistency in their generated videos?
-Object permanence and consistency over long durations were challenging because the model needed to understand that an object, such as a blue sign, remains present even after a character walks in front of it and passes by.
How does the video generation technology impact content creation and special effects?
-The technology has the potential to revolutionize content creation and special effects by enabling the generation of complex scenes and fantastical effects that would normally be expensive to produce using traditional CGI pipelines in Hollywood.
What is the significance of the video generation technology in the path towards Artificial General Intelligence (AGI)?
-The video generation technology is seen as a critical step towards AGI because it not only generates content but also learns intelligence about the physical world, contributing to a more comprehensive understanding of environments and interactions.
How does the technology handle different video styles and 3D spaces?
-The technology can adapt to various video styles, such as paper craft worlds, and understand 3D spaces by comprehending geometry and physical complexities, allowing for camera movements through 3D environments with people moving within them.
What are some of the unique capabilities of the video generation model 'Sora'?
-Sora can generate videos with different aspect ratios, perform zero-shot video style transfers, interpolate between different videos, and even simulate different worlds, such as Minecraft, with a high level of detail and understanding of the environment's physics.
How does the team at Aji House engage with external artists and red teamers to refine the technology?
-The team provides access to a small pool of artists and red teamers to gather feedback on how the technology can be made more valuable and safe, ensuring responsible use and addressing potential misuses.
What are some examples of creative applications of the video generation technology?
-Examples include creating a movie trailer featuring a 30-year-old spaceman, an alien blending in New York City, a scuba diver discovering a futuristic shipwreck, and a variety of animated content with unique styles and themes.
How does the training process for the video generation model differ from language models?
-While language models use auto-regressive methods, the video generation model uses diffusion, starting from noise and iteratively removing it to produce a video, allowing for both whole video generation or extension of shorter videos.
What are the future prospects for the 'Sora' video generation technology?
-The future prospects for Sora include further refinement of the model, increased control for artists over specific elements like camera paths, and the potential for simulating a wide range of worlds and environments beyond the physical world.
Outlines
🌟 Introduction to Aji House and Video Generation Achievements
The paragraph introduces the audience to Aji House, an organization that honors people like the attendees present. It highlights the achievements in video generation, emphasizing the creation of an HD, one-minute long video, which was a significant goal. The complexity of the video, including reflections, shadows, and object permanence, is discussed. The video's ability to learn about 3D space and physical complexities is also mentioned. The speakers, Tim and Bill, express excitement about the opportunities video generation presents for content creation and its role in advancing AI.
🎨 Artistic Collaborations and the Potential of Video Generation
This paragraph delves into the collaboration with artists and the research conducted to understand the value and safety of the video generation technology. It emphasizes the importance of external engagement and the role of red teamers in safety work. The artist Shai Kids is highlighted for their quote on the ability of the technology to create surreal imagery. The paragraph also showcases a video made by Shai Kids, demonstrating the technology's potential for new and unique forms of media and entertainment. The blog post 'Sora First Impressions' and the creative works of various artists are mentioned to illustrate the technology's versatility.
🚀 Scaling Video Models for AI and Content Creation
The focus of this paragraph is on the technological aspects of scaling video models and their importance in the journey towards AGI. It discusses the methodology of using SpaceTime patches as a foundation for training Transformers, similar to language models with tokens. The ability to generate videos with multiple aspect ratios and the use of zero-shot video capabilities are also covered. The paragraph highlights the creative potential of these models, comparing their future impact to the evolution of language models. It also touches on the technical report and additional examples that showcase the technology's capabilities.
🤖 Understanding Human Interaction and 3D Consistency
This paragraph discusses the detailed understanding of human interaction and 3D geometry that Sora is beginning to exhibit. It emphasizes the importance of scaling the model to achieve a realistic video generation, which requires an internal model of how objects, humans, and environments work. The paragraph also covers the potential for more complex interactions and the ability of video models to learn from various forms of intelligence. The concept of 3D consistency and object permanence is explored, along with the challenges that remain in perfecting these aspects. The paragraph concludes with a discussion on the potential of Sora as a world simulator, including its application in environments like Minecraft.
🛠️ Fine-Tuning, Industry Applications, and Future Directions
The final paragraph focuses on the potential for fine-tuning the model for specific characters or IPs, the importance of artist control, and the current stage of development. It addresses the use of diffusion in the video generation process, the lack of a fundamental constraint like scanline order in Vision Transformers, and the possibility of generating videos at 30 FPS. The paragraph also discusses the engagement with artists and red teamers, the focus on safety and responsible use, and the potential for interactive videos. The challenges in video data processing and the engineering work required are acknowledged, along with the goals achieved in building the first version of the technology.
🌐 Training Data and the Path to AGI
The paragraph concludes the discussion by addressing the question of training data requirements for AGI and whether the internet provides enough data. The speaker expresses confidence in the sufficiency of available data and the creativity of people in overcoming limitations. The talk ends with a positive outlook on the potential of AI and the impact of the technology discussed.
Mindmap
Keywords
💡AI
💡Video Generation
💡Object Permanence
💡3D Geometry
💡Content Creation
💡Transformers
💡Denoising
💡Stable Diffusion
💡Collaborators
💡Special Effects
💡Photorealistic
Highlights
Aji House honors people like the attendees, showcasing their importance in the creation process.
The introduction of an AI-generated video that is HD and a minute long, marking a significant leap in video generation technology.
Complexity in the video is noted, such as reflections and shadows, indicating advancements in video generation capabilities.
The video generation technology demonstrates object permanence and consistency over long durations, which is a challenging problem to solve.
The technology can produce various styles, such as a paper craft world, showcasing its versatility.
The AI understands the geometry and physical complexities of 3D spaces, indicating its learning capabilities about the physical world.
The potential of video generation to revolutionize content creation is discussed, with a focus on its impact on various industries.
A sample movie trailer is shown, featuring an astronaut persisting across multiple shots, demonstrating the technology's ability to create连贯 narratives.
The technology's implications for special effects in the film industry are explored, highlighting its potential to reduce costs and increase creativity.
The technology can generate photorealistic content as well as animated content, expanding the range of creative possibilities.
A unique scene with a blend of animals and a jewelry store is presented, showcasing the technology's ability to create complex, fantastical environments.
The technology's potential to democratize content creation is discussed, enabling more people to bring their creative visions to life.
The process of training the AI on a vast array of text data is explained, drawing parallels between language models and the visual AI being developed.
The concept of aspect ratios and how they can be used to generate content in various formats is discussed, highlighting the technology's adaptability.
The use of zero-shot video capabilities is described, allowing for the generation of videos in different styles from a single input.
The ability to interpolate between videos and create seamless transitions between different creatures or scenes is showcased.
The potential for the technology to contribute to AI by modeling how humans think and interact is discussed, emphasizing its importance on the path to AGI.
The importance of scaling the model and increasing compute is emphasized as a key factor in improving the technology.
The technology's ability to model complex scenes and interactions between agents, such as people and animals, is highlighted as a sign of future capabilities.
Transcripts
here's here's one thing about Aji house
we honor the people like you guys that's
why we have you here that's why you're
all here without the further Ado we're
[Applause]
going awesome what a big fun craft i'm
Tim this is Bill and we made s at open
ey together with a team of amazing
collaborators we're excited to tell you
a bit about it today we'll talk a bit
about at a high level what it does some
of the opportunities it has to impact
content creation some of the technology
behind it as well as why this is an
important step on the path to
AI so without further Ado here is a sord
generated video this one is really
special to us because it's HD and a
minute long and that was a big goal of
ours when we're trying to figure out
like what would really make a leap for
video generation you want 1080 fet
videos that are a minute long this video
does that you can see it has a lot of
complexity too like in the reflections
and the Shadows one really interesting
point that sign that blue sign she's
about to walk in front of
it and after she passes the sign still
exists afterwards that's a really hard
problem for video generation to get this
type of object permanence and
consistency over a long
duration so I can
do here we go a number of different St
Styles too this is a paper craft world
that I can imagine so that's really cool
and and also it can learn about a full
3d space so here the camera moves
through 3D as people are moving but it
really understands the geometry and the
physical complexities of the so Sor
learned a lot in addition to just being
able to generate content it's actually
learned a lot of intelligence about the
physical world just from training when
videos and now we'll talk a bit about
some of the opportunities with video
generation for revolutionizing creation
alluded to we're really excited about
Sora not only because we view it as
being on the critical path towards AGI
but also in the short term for what it's
going to do for Content so this is one
sample we like a lot so the front in the
bottom left here
is a movie trailer featuring The
Adventures of the 30-year-old Spaceman
hardest part of doing video by the way
is always just getting PowerPoint to
work with
it there we
[Music]
go all right now we're what's cool about
this sample in particular is that this
astronaut is persisting across multiple
shots which are all generated by we
didn't Stitch this together we didn't
have to do a bunch of outakes and then
create a composite shot at the end sword
decides where it wants to change the
camera but it do know that it's going to
put the same astronaut in a bunch of
different environments likewise we think
there's a lot of cool implications for
special effects this is one of our
favorite samples too an alien blending
in naturally New York City paranoia
Thriller style 35 mil and already you
can see that the model is able to create
these very Fantastical effects which
would normally be very expensive in
traditional CGI pipelines for Hollywood
there's a lot of implications here for
what this technology is going to bring
shortterm of course we can do other
kinds of effects too so this is more of
a Sci-Fi scene so it's a scuba diver
discovering a hidden futuristic
shipwreck with cybernetic marine life
and advanced alien
[Music]
technology as someone who's seen so much
incredible content from people on the
internet who don't necessarily have
access to tools like Sora to bring their
Visions to life they come up with like
cool phosy post them on Reddit or
something it's really exciting to think
about what people are going toble to do
with this
technology of course it can do more than
just photo realistic style you can also
do animated content
really ADOT my favorite part of this one
is the spell
otter a little bit of
charm and I think another example of
just how cool this technology is is when
we start to think about scenes which
would be very difficult to shoot with
traditional Hollywood kind of
infrastructure the problems here is the
Blom Zoo shop in New York City is with
the jewelry store and Zoo saber-tooth
tigers with diamond and gold adornments
Turtles with listening Emerald shells
Etc and what I love about this shot is
it's photo realistic but this is
something that would be incredibly hard
to accomplish with traditional tools
that they have in Hollywood today this
kind of shot would of course require CGI
it would be very difficult to get real
world animals into these kinds of scenes
but with Sora is pretty trivial and it
can just do it
on so I'll hand it over to Tim to chat a
bit about how we're working with artists
today with sort to see what they're able
to do yeah so we just came out with
pretty recently we given access to a
small pool of artists and maybe even to
take a step back this isn't yet a
product or something that is available
to a lot of people it's not in chat TBT
or anything but this is research that
we've done and we think that the best
way to figure out how this technology
will be valuable and also how to make it
safe is to engage with people external
from oration so that's why we came out
with this announcement and when we came
out with the announcement we started
working with small teams of red teamers
who helped with the safety work as well
as artist and people who will use this
technology so Shai kids is one of the
artists that we work with and I really
like this quote from them as greatest
Sora is at generating things that appear
real what excites us is the ability to
make things that are totally surreal and
I think that's really cool because when
you immediately think about oh
generating videos we have all these
existing uses of videos that we know of
in our lives and we quickly think about
turning those oh maybe stock videos or
existing films but what's really
exciting to me is what totally new
things are people into what completely
new forms of media and entertainment and
just new experiences for people that
we've never seen before are going to be
enabled by by Sora and by Future
versions of video and media generation
technology and now I want to show this
fun video that shy kids made using Sor
when we gave access to
them oh okay it has audio unfortunately
I guess we don't have that hooked up
it's this cute
plot about this guy with the balloon
head you should really go and check it
out we came out with this blog post Sora
First Impressions and we have videos
from a number of artists that we've
access to and there's this really cute
monologue to his guys talking about life
from the different perspective of me as
a guy with a balloon head right and this
is just awesome and so creative and the
other artists we've been access to have
done really creative and totally
different things from this too like the
way each artist uses this is just so
different from each other artist which
is really exciting because that says a
bit about the breath of ways that you
can use this technology but this is just
really fun and there are so many people
with such brilliant ideas as Phil was
talking about that maybe it would be
really hard to do things like this or to
make their film or their thing that's
not a film that's totally new and
different and hopefully this technology
will really democratize content Creation
in the long run it Ena so many more
people with creative ideas to be able to
bring those to life and show
them I'm want to talk a bit about some
of the technology behind soil so I'll
talk about it from the perspective of
language models and what has made them
work so well is the ability to scale and
better lesson that methods that improve
with scale in the long run are the
methods that will win out as you
increase compute because over time we
have more and more compute and if
methods utilize that well then they will
get better and better and language
models are able to do that in part
because they take all different forms of
text you take math and code and fros and
whatever is out there and you turn it
all into this universal language of
tokens and then you train
these big Transformer models on all
these different types of tokens this
this kind of universal model of Text
data and by training on this vast array
of different types of text you learn
these really generalist models of
language you can do all these things
right you can use chat gbt or whatever
your favorite language model is to do
all different kinds of tasks and it has
such a breadth of knowledge that it's
learn from the combination of this
variety of data and we want to do the
same thing for visual here that's
exactly what we did with Sora so we take
vertical videos and images and square
images low resolution high resolution
wide aspect ratio and we turn them into
patches and a patch is this little cube
in SpaceTime that you can imagine a
stack of frames a video is like a stack
of images that are all the frames and we
have this volume of pixels and then we
take these little cubes from inside and
you can do that on any volume of pixels
whether that's a high resolution image a
low resolution image regardless of the
aspect ratio long videos short videos
you turn all of them into these
SpaceTime patches and those are our
equivalent of tokens and then we train
Transformers on these SpaceTime patches
and Transformers are really scalable and
that allows us to think of this problem
in the same way that people think about
language problems of how do we get
really good at scal them and making
methods such that as we increase the
compute as we increase the data they
just get better and
better training on multiple aspect
ratios also allows us to generate with
multiple aspect
ratios there we go so here's the same
prompt and you can generate vertical and
square horizontal that's also a nice it
in addition to the fact that allows you
to use more data which is really Valu
you want to use all the data in its
native format as it exists it also gives
you more diverse ways to use the V so I
actually think vertical videos are
really nice like we look at content all
the time on our phones right so it's
nice to actually be able to generate
vertical and horizontal and a variety of
things and we can also use zero shot
some videoo video capabilities so this
uses a method which is a method that's
commonly used with diffusion we can
apply that our model uses to Fusion
which means that it Den noises the video
starting from noise
in order to create the video itely noise
so we use this method called sedit and
apply it and this allows us to change an
input video the offer left it's all
generated but it could be a real image
then we say rewrite the video in pixel
art style put the video in space with
the Rainbow Road or change the video to
a Medieval theme and you can see that it
edits the video but it keeps the
structure the same so in in a second it
will go through a tunnel for example and
it interprets that tunnel in all these
different ways this medieval one is
pretty amazing right because the model
is also intelligent so it's not just
changing something shallow about it but
it's medieval we don't really have a car
so I'm going to make a horse
scage and another fun capability that
the model has is to interpolate between
videos so here we have two different
creatures and this video in the middle
starts with the left and it's going to
end with the right and it's able to do
it in this really
seamless and amazing
[Music]
way so I think something that the past
slide in this slide really point out is
that there are so many unique and
creative things you can potentially do
with these models and similar to how
when we first had language models
obviously people were like oh like you
can use it for writing right okay yes
you can but there are so many other
things you can do with language models
and we're only we're even now like every
day people coming up with some creative
new cool thing you can do with the
language model the same thing's going to
happen for these visual models there are
so many creative interesting ways in
which we can use them and I think we're
only starting to scratch the surface of
what we can do with
them here's one I really love so there's
a video of a drone on the left and this
is a
butterfly underwater on the right and
we're going to interpolate between the
two
and some of the Nuance it gets like for
example that it makes the coliseum in
the middle as it's going slowly start to
Decay actually and going like it's
really spectacular some of the Nuance
that it gets right and and here's one
that's really cool too because it's like
how can you possibly go in the kind of
Mediterranean landscape to this
gingerbread house in a way that is like
consistent with physics in the 3D world
and it comes up with this really unique
solution to do it that it's actually
uded by the building and behind it you
start to see thiser red
[Music]
house so I encourage you if you haven't
we have in addition to when we released
our main blog post we also came up with
a technical report and the technical
report has these examples and it has
some other cool examples that we don't
have any these slides too again I think
it's really scratching the surface of
what we could do with these models but
check that out if you haven't there are
some other fun things you can do like
extending videos forward or backwards I
think we have here one example where
this is an image we generated this one
with Dolly 3 and then we're going to
animate this image using
Sora
sh all right now I'm going to pass it
off to Bill to talk a bit about why this
is important on the path to
AI all right of course everyone's very
bullish on the rule that llms are going
to play in getting to AGI but we believe
that video models are on the critical
path to it and concrete
we believe that when we look at very
complex scenes of Sor Genera like that
snowy scene in Tokyo that we saw at the
very beginning that Sora is already
beginning to show a detailed
understanding of how humans interact
with one another how they have physical
contact with one another and as we
continue to scale this Paradigm we think
eventually it's going to have to model
how humans think right the only way you
can generate truly realistic video with
truly realistic sequences of actions is
if you have an internal model of how all
objects humans Etc environments work and
so we think this is how Sora is going to
contribute to AI so of course the name
of the game here as it is with LM is
scaling and a lot of the work that we
put into this Paradigm in order to make
this happen was as Tim alluded to
earlier coming up with this Transformer
based framework that scals really eff
and so we have here a comparison of
different s models where the only
difference is the amount of training
compute that we put into the model so on
the far left there you can see Sora with
the base amount of compute it doesn't
really even know how dogs look it has a
rough sense that like camera should move
through scenes but that's about it if
you forx the amount of compute that we
put in for that training one then you
can see it now a she know what's like
can put a hat on it and it can put a
human in the background and if you
really crank up the compute and you go
to 32x base then you begin to see these
very detailed Textures in the
environment you see this very last
movement with the feet and the dog's
legs as it's navigating through the
scene you can see that the woman's hands
are beginning to interact with that
mided hat and so as we continue to scale
up Sora just as we find emergent
capabilities in llms we we believe we're
going to find emerging capabilities and
video models as well and even with the
amount of compute that we put in today
not that 32x Mark we think there's
already some pretty cool things that are
happening so I'm going to spend a bit of
time talking about that so the first one
is complex scenes and animals so this is
another sample for this beautiful snowy
Tokyo City M and again you see the
camera flying through the sea it's
maintaining this 3D geometry this
couple's holding hands you can see
people at the Stalls it's able to
simultaneously model very complex
environment with a lot of agents in it
so today can only do pretty basic things
like these fairly like lowle
interactions but as we continue to scale
the model we think this is indicative of
what we can expect in the future you
know more kind of conversations between
people which are actually substantive
and meaningful and more complex physical
interactions another thing that's cool
about video models compared to llms is
we can do anal got a great an here
there's a lot of intelligence Beyond
humans in this world and we can learn
from all that intelligence we're not
limited to one notion of it and you can
do animals we can do dogs we really like
this one this is a dog in barano Italy
and you can see it's wants to just go to
that other window s stumbles a little
bit but it recovers so it's beginning to
build the model not only about for
example humans and local through scenes
but how any
animal another property that we're
really excited about is this notion of
3D consistency so there was I think a
lot of debate at one point within the
academic Community about the extent to
which we need inductive biases and
generative models to really make them
successful and with Sora one thing that
we wanted to do from the beginning was
come up with a really simple and
scalable framework that completely assu
any kind of hard-coded inductive biases
from humans about physics and so what we
found is that this works so as long as
you scale up the model enough it can
figure out 3D geometry all by itself
without us having to bake and break
consistency into the model
correctly so here's an aerial view of
during the blue hour showcasing the
stunning architecture of white psychotic
buildings with Blue Domes and all these
aerial shots we found T to be like
pretty successful this s like you don't
have to cherry pick too much to get it
really does a great job at consistently
coming up with good results
here aerial view of Y both hikers as
well as a g water pole they do some
extreme hiking
at
[Music]
so another property which has been
really hard for video generation systems
in the past but Sora has mostly figured
out it's not perfect is object
permanence and so we can go back to our
favorite little scene of the Dalmation
in barano and you can see even as a
number of people pass by
it to dog still there so Sora not only
gets these kind of very like shortterm
interactions direct like saw earlier
with the woman passing by the blue sign
in Tokyo but even when you have multiple
levels of refusion can still
Rec in order to have like a really
awesome video generation system by
definition what you need is for there to
be non-trivial and really interesting
things that happen over time in the old
days when we were generating like 4C
videos uh usually all we saw were like
very light animated gifs that was what
most video generation systems were
capable of and Sora is definitely a step
forward and now we're beginning to see
signs that you can actually do like
actions that permanently affect the
world State and so this is i' say one of
like the weaker aspects of sord today it
doesn't nail this 100% of the time but
we do see Lems of success here so I'll
share a few here so this is a watercolor
painting and you can see that as the
artist is leaving brush Strokes they
actually skip with the canvas so they're
actually able to make a meaningful
change to the world and you don't just
get kind of like a blurry
nothing so this older man with hair is
devouring a cheeseburger wait for it
there we go so he actually Lees a bite
in it so these are very simple kinds of
interactions but this is really
essential for video generation systems
to be useful not only for Content
creation but also in terms of AGI and
being able to model long range
dependencies if someone does something
in the distant past and you we want to
generate a whole movie we need to make
sure the model can remember that and
that state is affected over time so this
is a step for that with
s
when we think about Sora as a world
simulator of course we're so excited
about modeling our real world's physics
and that's been a key component of this
project but at the same time there's no
real reason to stop there so there's
lots of other kinds of Worlds right
every single laptop we use every
operating system we use has its own set
of physics it has its own set of
entities and objects and rules and Sora
can learn from everything it doesn't
just have to be a real world physics
simulator so we're really excited about
the prospect ass simulating literally
everything and as a first step towards
that
we tried Minecraft so this is Sora and
the prompt is Minecraft with the most
gorgeous highres akk texture pack ever
and you can see already Sora knows a lot
about how Minecraft works so it's not
only rendering this environment but it's
also controlling the player with the
reasonably intelligible policy it's not
too interesting but it's doing something
and it can model all the objects in the
scene as well so we have another sample
with the sand
promps it shows a different texture pack
this time and we're really excited about
this notion that one day we can just
have a singular model which really can
encapsulate all the knowledge across all
these world so one joke we like to say
is you can run CHT in the video model
eventually and now let TR a bit about
failure cases so of course Sora has a
long way to go this is
really Sora has a really hard time with
certain kinds of physical interactions
still today that we would think as being
very simple so like share object in Sor
mind even simpler kinds of physics than
this if you drop a glass and shatter if
you try to do a sample like that s
will'll get it wrong almost time so it
really has a long way to go and
understanding very basic things that we
take for granted so we're by no means
anywhere near the end of this yet and to
wrap up we have a bunch of samples here
and we go to questions I think overall
we're really excited about where this
Paradigm is
going
we don't know
next to extend
it so we really view this as being like
the gpt1 of video and we think this
technology is going to get a lot better
very soon there's some signs of life and
some cool properties we're already
seeing like I just went over um but
we're really excited about this we think
the things that people are going build
on top of Ms like this are going to be
mindblowing and really amazing and we
can't wait to see what the world does
with it so thanks a
lot we have 10 minutes who goes
first all right um so question about
like understanding the agents or having
the agent interact with each other with
in the scene is that piece of
information explicit already or is it
just the P SS and then you have to run
like a can now talk good question so all
this is happening implicitly and so you
know when we see these like Minecraft
samples we don't have any notion of
where it's actually modeling the player
and where it's explicitly representing
actions within the environment so you're
right that if you wanted to be able to
exactly describe what is happening or
somehow read it off you would need some
other system on top of Sora currently to
be able to extract that information
currently it's all implicit in the
princi and emplo for that matter
everything's implicit 3D is implicit
everything is there's no
anything so basically the things that
you just describ right now is all the
cool probabilities derived from model
like
after cool
that's could you talk a little bit about
the potential for fine tuning so if you
have a very specific character or IP I
know for the the wave one you used an
input image for that how do you think
that those plugins
or built into the process yeah great
question so this is something we're
really interested in in general one
piece of feedback we've gotten from
talking with artists is that they just
want them all to be ask controlable as
possible to your point if they have a
character they really love and that
they've designed they would love to be
able to use that across s Generations
it's something that's actively on our
mind you could certainly do some kind of
fine tuning with the model if you had a
specific data set of your content that
you wanted to adapt the model for um we
don't currently we're really in like a
stage where we're just finding out
exactly like what people want so so this
kind of feedback is actually great for
us so we don't have a clear road map for
exactly that might be possible but in
theory it's
probably all right on the back you okay
so language Transformers you're like
pying autor regressively predicting this
like sequential manner but in
Transformers we do like this scanline
order maybe we do like a snake through
the spatial domain do you see this as a
fundamental constraint Vision
Transformers does it matter if you do
does the order at which you predict
token station
matter yeah good question in this case
we're actually using diffusion so it's
not an auto regressive Transformer in
the same way that language models are
but we're Den noising the videos that we
generate so we start from a video that's
entire noise and we iteratively run our
model to remove the noise and when you
do that enough times you remove all the
noise and you end up with a sample and
so we actually don't have this like scan
line order for example because you can
do the denoising across many SpaceTime
Patches at the same time and for the
most part we actually just do it across
the entire video at the same time we
also have a way and we get into this a
bit in that technical report that if you
want to you could first generate a
shorter video and then extend it so
that's also an option but it can be used
in either way either you can generate
the video all at once or you can
generate a shorter video and extended if
you
like yeah so the internet Innovation was
mostly driven by BN do you feel a need
to pay that adult industry
back I feel no need also
yeah all
right do you generate da at 30 Fram
second or do you like frames frame
generation at
that all the four way slower
than we generate 30
FPS okay have you tried like colliding
cars or like rotations and things like
that to see if the image generation F
fits into like a physical model world
that
OBS we've tried a few examples like that
I'd say rotations generally tend to be
pretty reasonable it's by no means
perfect I've seen it couple samples from
Sora of colliding cars I don't think
it's quite got three laws down
yet so what are the IND Ed that you
trying to fix right now with Sora that
your
so the engagement with people external
right now is mainly focused on artists
and how they would use it and what
feedback they have for being able to to
use it and people red teamers on safety
so that's really the two types of
feedback that we're looking for right
now and as Bill mentioned a really
valuable piece of feedback we getting
from artists the type of control they
want for example artists often want
control of the camera and the path of
the camera case also and then on the
safety concerns it's about we want to
make sure that if we were to give wider
access to this that it would be
responsible and safe and there are lots
of potential misuses for it and
disinformation there many concerns Focus
possible to make videos that a user
could actually interact with it like
through V or something so let's say like
video is playing halfway through I stop
it I change a few things around with
video just like Chris would I be able to
rest of the video incorporate those
changes it's a great idea right now Sora
is still pretty slow from the latency
perspective what we generally said
publicly is so it depends a lot on the
exact parameters of the generation
duration resolution if you're cranking
out this thing it's going to take at
least a couple minutes and so we're
still I'd say a way is off from the kind
of experience you're describing but I
think it' be really cool
thanks what were your stated goals in
building this first version and what
were some problems that you had along
the way that you learned
from I'd say the overarching goal was
really always to get to 1080p at least
30 seconds from like the early days of
the project so we felt like video
generation was stuck in the Rut of this
4 second like J generation
and so that was really the key focus of
the team throughout the project along
the way I think we discovered how
painful it is to work with video data
it's a lot of pixels in these videos and
it's a lot of just very detailed boring
engineering work that needs to get done
to really make these systems work and I
I think we knew going into it that it
would involve a lot of elbow grease in
that regard but yeah it certainly took
some time so I don't know any other
findings along the way yeah I mean we
tried really hard to keep the method
really simple and that is sometimes
easier said than done but I think that
was a big focus of just let's do the
simplest thing we possibly can and
really scale it and do the scaling
prop did you do the prom and see the
output it's not good enough then you go
TR again do the same prom and then it's
there that's first video then you do
more than training than the new prom and
new video is that the process you use in
this reling the
videos that's a good question evaluation
is challenging for videos we use a
combination of things one is your actual
loss and low loss is correlated with
models that are better so that can help
another is you can evaluate the quality
of individual frames using image metrics
so we do use standard image metrics to
evaluate the quality frames and then we
also did spend quite a lot of time
generating samples and looking at them
ourselves although in that case it's
important that you do it across a lot of
samples and not just individual prompts
because sometimes this process is noisy
so you might randomly get a good sample
and think that you Improvement so this
would be like you compare Lots ofrs in
the
outputs uh we can't comment on that one
last
question thanks for a great talk so my
question is on the training data so how
much training data do you estimate that
is required for us to get to AGI and do
you think we have enough data on the
internet yeah that's a good question I
think we have enough dat it to get to
AGI and I also think people always come
up with creative ways to improve things
and when we hit limitations we find
creative ways to improve regardless so I
think that whatever data we have will be
enough to get the AG wonderful okay
that's to AI thank
[Applause]
you
関連動画をさらに表示
OpenAI unveils its Voice Engine tool that can replicate people’s voices
SORA di OPENAI è TERRIFICANTE 🤯 Da TESTO a VIDEO con l'IA in 1 NANOSECONDO - Allarme o Rivoluzione?
Watch Out for the Best Text-to-Video AI Software on the Internet
OpenAI's Sora: How It Will Revolutionize the Future | AI Trends
Sora来了!AI生成视频的里程碑时刻!OpenAI发布最强视频生成模型SORA,终极目标是世界模型!Sora模型原理详解、案例应用解读以及影响 | SORA是什么 | SORA怎么用
24 Intelligenze Artificiali PAZZESCHE da provare nel 2024
5.0 / 5 (0 votes)