No Priors Ep.61 | OpenAI's Sora Leaders Aditya Ramesh, Tim Brooks and Bill Peebles
Summary
TLDR「No Priors」というポッドキャストのエピソードで、OpenAI SoraチームのAdia、Tim、Billが登場し、新しい生成型ビデオモデルSoraについて語っています。Soraはテキストプロンプトを受け取り、高解像度で視覚的に整合性のある1分間のクリップを生成できます。彼らは、このような大規模なビデオモデルがワールドシミュレータになる可能性と、スケーラブルなTransformersアーキテクチャをビデオ分野に適用することで、人工知能(AGI)への道を模索しています。また、Soraがどのようにして創造的なコンテンツの制作を容易にし、エンターテイメントや教育、コミュニケーションの新しいパラダイムをもたらす可能性があるかについて語ります。さらに、ビデオモデルの安全性に関する議論も行われ、偽造や誤情報のリスクを最小限に抑えるための取り組みが紹介されています。
Takeaways
- 🚀 Soraは、テキストプロンプトから高解像度で視覚的に一貫性のある、最大1分間のビデオクリップを生成できる新世代のビデオモデルです。
- 🤖 Soraの開発チームは、人工知能(AGI)への道に重要な役割を果たすと信じており、複雑な環境や世界を神経ネットワークの重みの中でモデル化できると見ています。
- 🌐 Soraは、将来的には人や動物、オブジェクトなどを含むワールドシミュレータとして機能し、人間がそれらと相互作用できる可能性があります。
- 🎨 Soraはアーティストやクリエイターにアクセスを提供し、フィードバックを得ることで、ツールとして最も有用になる方法や安全に導入する方法を学んでいます。
- 📈 Soraの能力は、コンピュートパワーやデータの追加によって向上し、より良いシミュレーションや長期的なビデオ生成が可能になる見込みです。
- 🧩 SoraはTransformerアーキテクチャを用いてスケーラブルで、ビデオデータの複雑な関係を学習することができます。
- 📚 Soraは、SpaceTime patchesという概念を導入し、ビデオを3D立方体として扱うことで、言語モデルのように様々なタイプの視覚データを扱えるようになりました。
- 🔍 Soraは、物理世界の理解を深めるために、ビデオから3D情報を自ら学習し、オブジェクトの永続性や詳細な物体相互作用を改善する予定です。
- 🌟 Soraのリリースは、ビデオモデルにおけるGPT-1のような瞬間であり、今後のバージョンでは創造性やAGIへの貢献がさらに期待されます。
- 🤔 Soraの開発チームは、偽造動画や誤情報のリスクなどの安全性問題に注意を払いながら、テクノロジーの利点を最大限に活かす方法を模索しています。
- ✨ Soraの進化は、人間の世界モデルよりも高精度な予測が可能になる一方で、人間の知能とは異なる知能の形を探求する可能性を示しています。
Q & A
Soraという新感覚の生成ビデオモデルはどのような特徴を持っていますか?
-Soraは、テキストプロンプトを受け取り、高解像度で視覚的に整合性のある、最大1分間のクリップを生成する新しい斉次変換子アーキテクチャを応用した生成ビデオモデルです。それは非常に複雑な環境や世界をニューラルネットワークの重みだけでモデル化できる可能性を示しています。
SoraがAGI(人工的一般知能)への道に沿っていると信じている理由は何ですか?
-Soraは複雑な環境をモデル化する能力を持ち、人々が相互作用し、考え方、さらには動物や様々な物体をモデル化する方法を学ぶ必要があるため、AGIへの道に沿っているとされています。
Soraを一般に利用可能にするためのロードマップやタイムラインはありますか?
-現時点では、製品の即時計画やタイムラインはありませんが、アーティストやレッドチームにアクセスを提供し、フィードバックを得ることで、Soraが世界に与える影響や人々にどのように役立つかについて考えています。
Soraの導入によって、安全に関する懸念は何ですか?
-Soraの導入によって、偽造動画やスプーンのリスク、誤情報の拡散など、新たな安全問題が懸念されます。また、企業やソーシャルメディア企業、ユーザーの責任分担についても検討する必要があります。
Soraの将来的な進化について、どのような期待を持っていますか?
-Soraはより複雑で長期的な物理的相互作用をより正確に捉えることができるようになると期待されています。また、Soraは3D情報を学び、人間の世界をより深く理解するため、より知能的で総合的なAIモデルに貢献すると見ています。
Soraが持つ創造性とビジュアルの魅力について、どのように考えられていますか?
-Soraは言語理解を通じてユーザーが生成内容を方向付けることができる能力を持っていますが、特にアディアの美学が深く埋め込まれているわけではありません。将来的には、個人のセンスに合わせてカスタマイズすることができると期待されています。
Soraのトークン化について説明してください。
-Soraでは、SpaceTimeパッチという概念を使ってデータを表します。これは、画像やビデオのデータがどのように存在するかにかかわらず、それを表すことができます。これにより、Soraは720pビデオだけでなく、縦ビデオやワイドスクリーンビデオ、1:1から2:1までのアスペクト比の映像を生成できます。
Soraのアーキテクチャの進化について教えてください。
-Soraのアーキテクチャは、画像ジェネレーターからビデオジェネレーターへと拡張するのではなく、1分間のHDビデオを生成するための問いを出発点として、スケーラブルで単純な方法でデータを分解することができるモデルを目指して進化しました。
Soraが今後のAI研究の道筋に与える影響について説明してください。
-Soraは、ビデオデータから学び、3Dや物体の永続性などの概念を理解し、人間の世界をより深く知ることができます。これにより、AIモデルがより知能的で総合的になることが期待されており、ビデオジェネレーション以外にも幅広い影響を与える可能性があります。
Soraの将来のアップデートで期待されることとは何ですか?
-Soraの今後のアップデートでは、より複雑で長期的な物理的相互作用の正確性を向上させることに加えて、個人の美学やスタイルをよりよく理解し、カスタマイズすることができるようになると期待されています。
Soraの導入に伴い、社会的な責任や安全性に関する問題についてどのように考えていますか?
-Soraの導入に伴い、誤情報のリスクや偽造メディアの拡散などの問題に対処する必要があります。また、Soraを展開する企業は、技術の責任を負い、社会メディア企業やユーザーにも責任を分担してもらう必要があります。
Outlines
🎉 Soraの紹介とAGIへの道
このエピソードでは、OpenAI SoraチームのAdia、Tim、Billが招かれ、新しい生成型ビデオモデルSoraについて語ります。Soraはテキストプロンプトを受け取り、高解像度で視覚的に整合性のある1分間のクリップを生成できます。Soraは、ビデオドメインにスケーラブルなTransformersアーキテクチャを適用することで、大型ビデオモデルが世界シミュレータになる可能性を示しています。また、SoraはAGI(人工的に総じて知能を持つ)への重要な道筋だとチームが信じています。
🤖 Soraの応用と創造性
Soraの将来的な用途について語り合う。短編映画や他のメディアジャンルに組み込まれる可能性があると予想されていますが、新しい創造的な使用方法も見出されるでしょう。また、ロボティクスや物理エンジンシミュレーションなど、Soraがどのように未来のアプリケーションに貢献するかについても議論します。Timは、diffusion Transformerと呼ばれるSoraの基礎技術について説明し、その進化と将来の可能性に期待しています。
🧩 Soraのアーキテクチャとトークン化
Soraのアーキテクチャとトークン化について説明します。SpaceTimeパッチという概念を使って、ビデオを3D立方体として扱い、Transformerモデルにフィードすることで、ビデオ生成の柔軟性を高めています。これにより、Soraは解像度やビデオの長さにとらわれずに、さまざまなタイプのビジュアルコンテンツを生成できます。また、Soraの開発に必要なインフラストラクチャとシステムの構築についても触れています。
🌟 Soraの美学と創造性の今後
Soraの視覚的な美しさについて語り合う。Soraは、言語理解を通じてユーザーが生成内容を制御できるように設計されています。将来的には、アーティストが自分のポートフォリオをアップロードして、モデルがデザイン会社の専門用語や独自の美学を理解できるようになるでしょう。また、Soraが教育やコミュニケーションにどのように適用される可能性があるかについても議論し、新しいエンターテイメントパラダイムの到来を期待しています。
🚧 Soraの安全性と将来の課題
Soraの安全性と偽造可能性について話し合う。Soraは、誤情報のリスクや安全性に関する問題を慎重に対処する必要があります。また、Soraの将来の課題として、オブジェクトの永続性や複雑な物理的な相互作用の改善が挙げられます。Soraは現在では、複雑なオブジェクト相互作用を正確に扱うことができない場合があり、改善の余地があります。
🌐 Soraの研究ロードマップと社会的意義
Soraの研究ロードマップと、それがAI全体の進化に与える影響について語ります。Soraは、ビデオデータから3D情報を学び、人間の知覚や世界との相互作用を理解する力を持つ可能性があります。これは、より知能的なAIモデルにつながる重要な要素になるでしょう。また、Soraの将来のバージョンが、人間の世界モデルを超える知能を持つ可能性についても議論しています。
📈 Soraのスケーラビリティと期待
Soraのスケーラビリティと、コンピュートパワーを増やすことによる将来の進化について話します。Soraは、データ予測の単純なタスクにフォーカスしており、これはスケーラビリティにおいて効果的です。また、Soraのリリースは、GPTモデルの登場のような重要な瞬間であり、非常に迅速に進化すると期待されています。最後に、Soraのリリースとそれに伴う創造性の未来、AGIへの道のり、そして安全性に関する配慮についても触れています。
Mindmap
Keywords
💡Open AI Sora
💡World simulators
💡Scalable Transformers architecture
💡AGI (Artificial General Intelligence)
💡Diffusion Transformer
💡SpaceTime patches
💡Aesthetic
💡Misinformation
💡Object permanence
💡Physical interactions
Highlights
OpenAI Sora is a new generative video model capable of creating high-definition, visually coherent clips up to a minute long from text prompts.
Sora raises the possibility of large video models acting as world simulators by applying scalable Transformers architecture to the video domain.
The team behind Sora believes that models like Sora are on a critical pathway to achieving AGI (Artificial General Intelligence).
Sora's potential future applications include world simulation where users can interact with complex environments modeled within a neural network.
OpenAI is currently providing access to Sora to a small group of artists and red teamers to gather feedback and assess the impact of the technology.
Feedback from artists suggests a need for more control and the potential for extending the model's capabilities beyond text input.
The team is inspired by the creative ways artists are using Sora to tell stories and generate compelling content.
Sora's technology is based on a diffusion Transformer, which generates data by starting from noise and iteratively removing it.
The Transformer architecture allows Sora to scale and improve as more compute power and data are applied to training the model.
Sora introduces the concept of 'SpaceTime patches', enabling the model to represent and learn from data in various video resolutions and lengths.
The team is focused on fundamental research and improvements to Sora rather than specific downstream applications like digital avatars.
Safety considerations are a priority, especially regarding the potential for misinformation and the need for responsible deployment of the technology.
Sora's development is seen as analogous to the early stages of GPT models, with expectations of rapid improvement and increased capabilities.
The team is excited about the creative potential of Sora and its future role in entertainment, education, and communication.
Sora's ability to learn about the world from visual data is expected to contribute to more intelligent AI models that better understand and interact with the world.
The team aims to make Sora more accessible by addressing computational costs and safety concerns to democratize the technology's use.
Sora's current limitations include the accuracy of complex, long-term physical interactions, which the team anticipates will improve with further development.
The public may misunderstand the potential and development trajectory of video models like Sora, which the team compares to the early stages of language model development.
Transcripts
[Music]
hi listeners welcome to another episode
of no priors today we're excited to be
talking to the team behind open AI Sora
which is a new generative video model
that can take a text prompt and return a
clip that is high defin visually
coherent and up to a minute long Sora
also raised the question of whether
these large video models are World
simulators and applied the scalable
Transformers architecture to the video
domain we're here with the team behind
it Adia remesh Tim Brooks and Bill pees
welcome to no priors guys thanks so much
for having us to start off why don't we
just ask each of you to introduce
yourselves so our listeners know uh who
we're talking to did you mind starting
us off sure I'm Adia I lead the Sora
team together with Tim and Bill hi I'm
Tim I also lead the Sora team I'm Bill
also lead the Sora team simple enough um
maybe we can just start with you know
the open AI mission is AGI right um
greater intelligence is text to video
like on path to that mission how'd you
end up working on this yeah we
absolutely believe models like Sora are
really on the critical Pathway to AGI we
think one sample that illustrates this
kind of nicely is a scene with a bunch
of people walking through Tokyo during
the winter and in that scene there's so
much complexity so you have a camera
which is flying through the scene
there's lots of people which are
interacting with one another they're
talking they're holding hands they're
people selling items at nearby stalls
and we really think this sample
illustrates how Sora is on a pathway
towards being able to model extremely
complex environments and worlds all
within the weights of a neural network
and looking forward you know in order to
generate truly realistic video you have
to have learned some model of how people
work how they interact with others how
they think ultimately and not only
people also animals and really any kind
of object you want to model and so
looking forward as we continue to scale
up models like Sora we think we're going
to be able to build these like World
simulators where essentially you know
any body can interact with them I as a
human can have my own simulator running
and I can go and like give a human in
that simulator work to go do and they
can come back with it after they're done
and we think this is a pathway to AGI
which is just going to happen as we
scale up Sora in the future it's been
said that we're still far away despite
massive demand for a consumer product
like what uh is is that on the road map
what do you have to work on before you
you have broader access to Sora timy
want to talk about sure yeah so we
really want to engaged with people
outside of open a and thinking about how
Sora will impact the world how it will
be useful to people and so we don't
currently have immediate plans or even a
timeline for creating a product but what
we are doing is we're giving access to
Sora to a small group of artists as well
as to Red teamers to start learning
about what impact Sora will have and so
we're getting feedback from artists
about how we can make it most useful as
a tool for them as well as feedback from
Red teamers um about how we can make
this safe how we we could introduce it
to the public and this is going to set
our road map for our future research and
inform if we do in the future end up
coming up with the product or not um
exactly what timelines that would have
did can you tell us about some of the
feedback you've gotten yeah so we have
given access to Sora to like a small
handful of artists and creators just to
get early feedback um in general I think
a big thing is just control ability so
right now the model really only accepts
text as input and while that's it's
useful it's still pretty constraining in
terms of being able to uh specify like
precise descriptions of what you want so
we're thinking about like you know how
to extend the capabilities of the model
potentially in the future so that you
can supply inputs other than just text
do you all have a favorite thing that
you've seen artists or others use it for
or a favorite video or something that
you found really inspiring I know that
when it launched a lot of people were
really stricken by just how beautiful
some of the images were how striking how
you'd see the shadow of a cat in a pool
of water things like that but as just
curious what what you've seen sort of
emerge as people more and more people
have started using it yeah it's been
really amazing to see what the artists
do with the model because we have our
own ideas of some things to try but then
people who for their profession are
making creative content are like so
creatively brilliant and do such amazing
things so shy kids had this really cool
video that they made this short story uh
Airhead with um this character that has
a balloon and they really like made this
story and there it
was really cool to see a way that Sora
can unlock and make this story easier
for them to tell and I think there it's
even less about like a particular clip
or video that Sora made and more about
this story that the these artists want
to tell and are able to share and that
Sora can help enable that so that is
really amazing to see you you mentioned
the Tokyo scene others my personal
favorite sample that we've created is uh
the bling Zoo so I posted this on my
Twitter uh the day we launched Sora and
it's essentially a multi-shot scene of a
zoo in New York which is also a jewelry
store and so you see like saber-tooth
tigers kind of like decked out with
bling it was very surreal yeah yeah and
so I love those kinds of samples because
as someone who you know loves to
generate creative content but doesn't
really have the skills to do it it's
like so easy to go play with this model
and to just fire off a bunch of ideas
and uh get something that's compelling
like the time it took to actually
generate that in terms of iterating on
prompts was you know really like less
than an hour to like get something I
really loved um so I had so much fun
just playing with the model to get
something like that out of it and it's
great to see the artists are also
enjoying using the models and getting
great content from that what do you
think is a timeline to broader use of
these sorts of models for short films or
other things because if you look at for
example the evolution of Pixar they
really started making these Pixar shorts
and then a subset of them turned into
these longer format movies and um a lot
of it had to do with how well could they
actually World model even little things
like the movement of hair or things like
that and so it's been interesting to
watch the evolution of that prior
generation of technology which I now
think is 30 years old or something like
that do you have a prediction on when
we'll start to see actual content either
from Sora or from other models that will
be professionally produced and sort of
part of the broader media genre that's a
good question I I don't have a
prediction on the exact timeline but but
one thing related to this I'm really
interested in is what things other than
like traditional films people might use
this for I do think that yeah maybe over
the next couple years we'll see people
starting to make like more and more
films but I think people will also find
completely new ways to use these models
that are just different from the current
media that we're used to because it's a
very different Paradigm when you can
tell these models kind of what you want
them to see and they can responds in a
way and maybe there are just like new
modes of interacting with content that
like really creative artists will come
up with so I'm actually like most
excited for what totally new things
people will be doing that's just
different from what we C have it's
really interesting because one of the
things you mentioned earlier this is
also a way to do World modeling and I
think at it you've been at open AI for
something like five years and so you've
seen a lot of the evolution of models in
the company and what you've worked on
and I remember going to the office
really early on and it was initially
things like robotic arms and it was
self-playing Games and Things or
selfplay for games and things like that
um as you think about the capabilities
of this world simulation model do you
think it'll become a physics engine for
simulation where people are you know
actually simulating like wind tunnels is
it a basis for robotics andus is there
is it something else I'm just sort of
curious where are some of these other
futured forward applications that could
emerge yeah I I totally think that
carrying out simulations in the video
model is is something that we're going
to be able to do um in the future at
some point um Bill actually has a lot of
thoughts about uh this sort of thing so
maybe you can yeah I mean I think you
hit the nail on the head
applications like robotics um you know
there's so much you learn from video
which you don't necessarily get from
other modalities which companies like
open ey have invested a lot in the past
like language you know like the minutia
of like how arms and Joints move through
space you know again getting back to
that scene in Tokyo how those legs are
moving and how they're making contact
with the ground in a physically accurate
way so you learned so much about the
physical world uh just from training on
raw video that we really believe that
it's going to be essential for uh things
like physical embodiment moving forward
and talking more about uh the model
itself there are a bunch of really
interesting Innovations here right so
not to put you on the spot Tim but can
you uh describe for a broad technical
audience what a diffusion Transformer is
totally so Sora Builds on Research from
both the dolly models and the GPT models
at openingi and diffusion is a process
that creates uh data in our case videos
by starting from noise and iteratively
removing noise many times until
eventually you've removed so much noise
that it just creates a sample and so
that is our process for generating the
videos we start from a video of noise
and we remove it incrementally but then
architecturally it's really important
that our models are scalable and that
they can learn from a lot of data and
learn these really complex and
challenging relationships and videos and
so we use an architecture that is
similar to the GPT models and that's
called a Transformer and so diffus
Transformers combining these two
concepts and the Transformer
architecture allows us to scale these
models and as we put more compute and
more data into training them they get
better and better and we even released a
technical report on Sora and we show the
results that you get from the same
prompt when you use a smaller amount of
compute an intermediate amount of
compute and more compute and by using
this method as you use more and more
compute the results get better and
better and we strongly this trend will
continue so that by using this really
simple methodology we'll be able to
continue improving these models by
adding more compute adding more data and
they will be able to do all these
amazing things we've been talking about
having better simulation in longer term
Generations Bill uh can we characterize
at all what the scaling laws for this
type of model look like yet good
question so as Tim alluded to you know
one of the benefits of using
Transformers is that you inherit all of
their great properties that we've seen
in other domains like language um so you
absolutely can begin to come up with
scaling laws for video as opposed to
language and this is something that you
know we're actively looking at in our
team and you know not only constructing
them but figuring out ways to make them
better so you know if I use the same
amount of training compute can I get an
even better loss uh without
fundamentally increasing the amount of
compute needed so these are a lot of the
questions that we tackle day-to-day on
the research team to make Sora and
future models as good as
possible one of the like questions about
applying you know trans forers in this
domain is um like tokenization right uh
and so by the way I don't know who came
up with this name but like late in
SpaceTime patches is like a great sci-fi
name here can you explain like what that
is and like why why it is relevant here
because you know the ability to do
minute long Generation Um and get to uh
like Visual and temporal coherence is
really amazing I don't think we came up
with it like as a name so much as like a
descriptive thing of exactly what like
that's what we call yeah even better
though so one of the critical successes
for the llm Paradigm has been this
notion of tokens so if you look at the
internet there's all kinds of Text data
on it there's books there's code there's
math and what's beautiful about language
models is that they have this singular
notion of a token which enables them to
be trained on this vast swath of like
very diverse data there's really no
analog for prior visual generative
models so you know what was very
standard in the past before Sora is that
you would train say an image generative
model or a video generative model on
just like 256x 256 resolution images or
256 x 256 video that's exactly like 4
seconds long and this is very limiting
because it limits the types of data you
can use you have to throw away so much
of you know uh the visual data that
exists on the internet and that limits
like the generalist capabilities of the
model so with Sora we introduced this
notion of SpaceTime patches where you
can essentially just represent data
however it exists in an image and a
really long video and like a a tall
vertical video by just taking out cubes
so you can essentially imagine right a
video is just like a stack a vertical
stack of uh individual images and so you
can just take these like 3D cubes out of
it and that is our notion of a token
when we ultimately feed it into the
Transformer and the result of this is
that Sora you know can do a lot more
than just generate say like 720p video
um at for some like fixed duration right
you can generate vertical videos
widescreen videos you can do anything uh
between like 1 to2 aspect ratio to 2:
one it can generate images it's an image
generation model and so this is really
the first generative model of visual
content uh that has breadth in a way
that language models have breadth so
that was really why we pursued this
direction I feels just as important on
the like input and training side right
in in terms of being able to take in
different types of video absolutely and
so a huge part of this project uh was
really developing the infrastructure and
systems needed to be able to work with
this vast data um in a way that hasn't
been needed for previous image or video
generation systems a lot of the models
before Sora that were working on video
were really looking at extending image
generation models and so there was a lot
of great work on image generation and
what many people have been doing is
taking an image generator and extending
it a bit instead of doing one image you
can do a few seconds but what was really
important for Sora and was really this
difference in architecture was instead
of starting from an image generator and
trying to add on video we started from
scratch and we started with the question
of how are we going to do a minute of HD
footage and that was our goal and when
you have that goal we knew that we
couldn't just extend an image generator
we knew that in order to am of HD
footage we needed something that was
scalable that broke down data into a
really simple way so that we could use
scalable models so I think that really
was the architectural Evolution from
image generators to what led us to Sora
that's a really interesting framework
because it feels like it could be
applied to all sorts of other areas
where people aren't currently applying
end to end deep learning yeah I think
that's right and it it makes sense
because in the shortest term right we
weren't the first to come out with a
video generator a a lot of people and
and a lot of people have done impressive
work on video generation but we were
like okay we'd rather pick a point
further in the future and just you know
work for a year on that um and there is
this pressure to do things fast because
AI is so fast and the fastest thing to
do is oh let's take what's wor working
now and let's kind of like add on
something to it and that probably is as
you're saying more General than just
image to video but other things but
sometimes it takes taking a step back
and saying like what what will the
solution to this look like in three
years let's start building that MH yeah
it seems like a very similar transition
happened in self-driving recently where
where people went from bespoke Edge case
sort of predictions and heuristics and
all bit of DL to like end to end deep
learning yeah in some of the new models
so it's it's very exciting to see it
applied to video one of the Striking
things about Sora is just the visual
aesthetic of it and I'm a little bit
curious how did you go about either uh
tuning or crafting that aesthetic
because I know that in some of the more
traditional um image gen models uh you
both have feedback that helps impact
evolution of aesthetic over time but in
some cases people are literally tuning
the models and so I'm a little bit
curious how you thought about it in the
context of Sora yeah well to be honest
we didn't spend a ton of effort on it
for Sora the world is just beautiful
yeah oh this is a great answer yeah
I I think that's maybe the honest answer
to most of it I think sora's language
understanding definitely allows the user
to steer it uh in a way that would be
more difficult with like other models so
you can provide a lot of like hints and
visual cues that will sort of steer the
model toward the type of generations
that you want but it's not like the Adia
aesthetic is like deeply embedded yeah
not yet um but I think moving to the
future you know I I Feel Like the Model
is kind of empowering people to sort of
um uh get it to grock your personal
sense of aesthetic is going to be
something that uh a lot of people will
look forward to uh many of the artists
and creators that we talked to they'd
love to just like upload their whole
portfolio of assets to the model and be
able to draw up on like a large body of
work when they're writing captions and
have the model understand like the
jargon of their design firm accumulated
over many decades and so on um so I
think personalization and and uh how
that will kind of work together with
Aesthetics is going to be a cool thing
to explore later on I think to the point
um Tim was making about just like a you
know new applications Beyond traditional
entertainment I work and I travel and
have young kids and so I don't know if
this is like something to be judged for
or not but one of the things I do today
is um generate what amount to like short
audio books with voice cloning um Dolly
images and you know stories in the style
of like the Magic Treehouse or whatever
in around some topic that either
I'm interested in like ah you know hang
out with Roman Emperor X right or um
something the the girls my kids are
interested in but this is
computationally expensive and hard and
not quite possible but I imagine there's
some version of like desktop Pixar for
everyone which is like you know I think
kids are going to find this first but
I'm going to narrate a story and have
like magical visuals happen in real time
I think that's a very different
entertainment Paradigm than we have now
totally I mean are we going to get it I
yeah I think we're headed there and a
different entertainment Paradigm and
also a different educational Paradigm
and a communication Paradigm
entertainment's a big part of that but I
think there are actually many potential
applications once this really
understands our world and so much of our
world and how we experience it is Visual
and something really cool about these
models is that they're starting to
better understand our world and what we
live in and the things that we do and we
can potentially use them to entertain us
but also to educate us and like
sometimes if I'm trying to learn
something the best thing would be if I
could get a custom tailored educational
video to explain it to me or if I'm
trying to communicate something to
someone you know maybe the best
communication I could do is make a video
to explain my point so I think that
entertainment but also kind of a much
broader set of potential things that
video models could be useful for that
makes sense I mean that resonates in
that I think if you ask people under
some certain age cut off that they'd say
the the biggest driver of educational
world to YouTube today right Better or
Worse yeah have you all tried applying
this to things like digital avatars I
mean there's companies like cesia hijen
Etc they're doing interesting things in
this area but having a
true um uh something that really
encapsulates a person in a very deep and
Rich way uh seems kind of fascinating as
one potential adapt adaptive approach to
this I'm just sort of curious if you've
tried anything along those lines yet or
if if it's not really applicable giving
then it's more of like text to video
prompts so we haven't we've really
focused on just the core technology
behind it so far so we haven't focused
that much on for that matter particular
applications including the idea of
avatars which makes a lot of sense and I
think that would be very cool to try I
think where we are in the trajectory of
Sora right now is like this is the gpt1
of these this new paradigm of visual
models and that we're really looking at
the fundamental Research into making
these way better making it a way better
Engine That Could power all these
different things so that's so our focus
is just on this fundamental development
of the technology right now maybe more
so than specific Downstream yeah one of
the reasons I ask about the Avatar stuff
as well is it starts to open questions
around safety and so I was a little bit
curious you know how you all thought
about um safety in the context of video
models and the potential to De fakes or
spoofs or things like that yeah I can
speak a little bit to that it's
definitely a pretty complex topic I
think a lot of the safety mitigations
could probably be ported over from Dolly
3 um for example the way we handle like
Gracie images or gory images things like
that um there's definitely going to be
new safety issues to worry about for
example
misinformation um or for example like do
we allow users to generate images that
have offensive words on them and I think
one key thing to figure out here is like
how much responsibility uh do the
companies deploying this technology bear
uh how much should social media
companies do for example to inform users
that content they're seeing uh may not
be from a trusted source and how much
responsibility does the user bear for
you know using this technology to create
something in the first place um so I
Think It's Tricky and we need to think
hard about these issues to sort of uh
reach a position that that we think is
is going to be best for people that
makes sense it's also there's a lot of
precedent like people used to use
Photoshop to manipulate images and then
publish them yeah and make claims and
it's not like uh people said that
therefore the maker of Photoshop is
liable for somebody abusing technology
so it seems like there's a lot of
precedent in terms of how you can think
about some of these things as well yeah
totally like we want to release
something that people feel like they
really have the freedom to express
themselves and do what they want to do
um but at the same time sometimes that's
at odds with uh you know doing something
that is responsible and sort of
gradually um releasing the technology in
a way that people can get used to it I
guess a question follow you maybe
starting with Tim is like and if you can
share this great if not understood but
uh what is the thing you're most excited
about in terms of the your product road
map or where you're heading or some of
the capabilities that you're working on
next yeah um great question I'm really
excited about the things that people
will create with this I think there are
so many brilliant creative people with
ideas of things that they want to make
and sometimes being able to make that is
really hard because it requires
resources or tools or things that you
don't have access to and there's the
potential for this technology to enable
so many people with brilliant creative
ideas to make things and I'm really
excited for what awesome things they're
going to make and that this technology
will help them make Bill maybe one one
question for you would just be if this
is um as you just mentioned like the
gpt1 uh with have a long way to would go
uh this isn't something that the general
public has an opportunity experiment
with yet can you sort of characterize
what the limitations are or the gaps are
that you want to work on besides the
obvious around like length right yeah so
I think in terms of making this
something that's more widely available
um you know there's a lot of
serving kind of considerations that have
to go in there so a big one here is
making it cheap enough for people to use
so we've said you know in the past that
in terms of generating videos it it
depends a lot on the exact parameters of
you know like the resolution and the
duration of the video You're creating uh
but you know it's not instant and you
have to wait at least like a few minutes
uh for like these really long videos
that we're generating and so we're
actively working on threads here to make
that cheaper in order to democratize
this uh more broadly uh I think there's
a lot of considerations as a DN we're
alluding to on the safety side as well
um so in order for this to really become
more broadly accessible we need to you
know make sure that especially in an
election year we're being really careful
with the potential for misinformation
and any surrounding risks we're actively
working on addressing these threads
today that's a big part of our research
road map what about just core um like uh
for lacking about a term like quality
issues yeah Are there specific things
like if it's object permanence or
certain types of interactions you're
thinking through yeah so as we look you
know forward to you know like the gpt2
or gpt3 moment uh I think we're really
excited for very complex long-term
physical interactions to become uh much
more accurate so to give a concrete
example of where Sora falls short today
you know if I have a video of someone
like playing socer and they're kicking
around a ball at some point you know
that Ball's probably going to like
vaporize and maybe come back um so it
can do certain kinds of simpler
interactions pretty reliably you know
things like people walking for example
um but these types of more detailed
object toob interactions are definitely
uh you know still a feature that's in
the oven and we think it's going to get
a lot better with scale but that's
something to look forward to moving
forward there's one sample that I think
is like a glimpse of the few I mean sure
there many but there's one I've seen uh
which is um you know a man taking a bite
of a burger and the bite being in the
burger in terms of like keeping steak
which is very cool yeah we are really
excited about that one also there's
another one where uh it's like a woman
like painting with watercolors on a
canvas and it actually leaves a trail so
there's like glimmers of you know kind
of capability in the current model as
you said and we think it's going to get
much better in the future is there
anything you can say about how um the
work you've done with Sora uh sort of
affects the broader research road map
yeah so I think something here is
about the knowledge that Sora ends up
learning about the world just from
seeing all this visual data it
understands 3D which is one cool thing
because we haven't trained it to we
didn't explicitly bake 3D information
into it whatsoever we just trained it on
video data and it learned about 3D
because 3D exists in those videos and it
learned that when you take a bite out of
a hamburger that you leave a bite mark
so it's learning so much about our world
and when we interact with the world so
much of it is visual so much of what we
see and learn throughout our lives is
visual information so we really think
that just in terms of intelligence in
terms of leading toward AI models that
are more intelligent that better
understand the world like we do this
will actually be really important for
them to have this grounding of like hey
this is the world that we live in
there's so much complexity in it there's
so much about how people interact how
things happen how events in the past end
up impacting events in the future that
this will actually lead to just much
more intelligent AI models more broadly
than even generating videos it's almost
like you invented like the future visual
cortex plus some part of the uh
reasoning parts of the brain or
something sort of simultaneously yeah
and that's a cool comparison because a
lot of the intelligence that humans have
is actually about world modeling right
all the time when we're thinking about
how we're going to do things we're
playing out scenarios in our head we
have dreams where we're playing out
scenarios in a head we're thinking in
advance of doing things if I did this
this thing would happen if I did this
other thing what would happen right so
we have a world model and building Sora
as a world model is very similar to a
big part of the intelligence that humans
have um how do you guys think about the
uh sort of analogy to humans as having a
very approximate World model versus
something that is um as accurate as like
let's say a uh a physics engine in the
traditional sense right because if I you
know hold an apple and I drop it I
expect to fall at a certain rate but
most humans do not think of that as
articulating a path with a speed as a
calculation um do you think that sort of
learning is like parallel in um large
models I think it's a a really
interesting observation
I think how we think about things is
that it's almost like a deficiency you
know in humans that it's not so high
fidelity so you know the fact that we
actually can't do very accurate
long-term prediction when you get down
to a really narrow set of physics um
it's something that we can improve upon
with some of these systems and so we're
optimistic that Sora will you know
supersede that kind of capability and
will you know in the long run enable it
to be more intelligent one day than
humans as World models um but it is you
know certainly existence proof that it's
not necessary for other types of
intelligence regardless of that it's
still something that Sora and and models
in the future will be able to improve
upon okay so it's very clear that the
trajectory prediction for like throwing
a football is going to be better than
the next next versions of these models
and minus let's say if if I could add
something to that this relates to the
Paradigm of scale and uh the bitter
lesson a bit about how we want methods
that as you increase compute get get
better and better and something that
works really well in this Paradigm is
doing the simple but challenging task of
just predicting data and you can try
coming up with more complicated tasks
for example something that doesn't use
video explicitly but is maybe in some
like space that simulates approximate
things or something but all this
complexity actually isn't beneficial
when it comes to the scaling laws of how
methods improve as you increase scale
and what works really well as you
increase scale is just predict data and
that's what we do with text we just
predict text and that's exactly what
we're doing with visual data with Sora
which is we're not making some
complicated trying to figure out some
new thing to optimize we're saying hey
the best way to learn intelligence in a
scalable manner yeah is to just predict
data that makes sense in relating to
what you said Bill like predictions will
just get much better with no necessary
limit that approximates that's right
humans right yeah is there is there
anything uh you feel like the general
public misunderstands about video models
or about Sora or you want them to
know I think
maybe the biggest update to people with
the release of Sora is that internally
we've always made an analogy as Bill and
Tim said between Sora and GPT models in
that um you know when gpt1 and gpt2 came
out it started to become increasingly
clear um to some people that simply
scaling up these models would give them
amazing capabilities
and it wasn't clear right away if like
oh we scaling up next token prediction
result in a language model that's
helpful for writing code um to us like
it's felt pretty clear that applying the
same methodology to video models is also
going to result in really amazing
capabilities um and I think Sora 1 is
kind of an existence proof that there's
one point on the scaling curve now and
we're very excited for what this is
going to lead to yeah amazing well I I
don't know why it's such a surprise to
everybody but there less and wins again
yeah yeah I would just say that as both
Tim and Edo were alluding to we really
do feel like this is the gpt1 moment and
these models are going to get a lot
better very quickly and we're really
excited both for the incredible benefits
we think this is going to bring to the
creative world what the implications are
long-term for AGI um and at the same
time we're trying to be very mindful
about the safety considerations and
building a robust stack now to to make
sure that society's actually going to
get the benefits of this with while
mitigating the downsides uh but it's
exciting times and we're looking forward
to what future models are going to be
capable of yeah congrats on such an
amazing amazing
release find us on Twitter at no priors
pod subscribe to our YouTube channel if
you want to see our faces follow the
show on Apple podcasts Spotify or
wherever you listen that way you get a
new episode every week and sign up for
emails or find transcripts for every
episode at no- pri.com
Browse More Related Video
![](https://i.ytimg.com/vi/LnZ07X99xnA/hq720.jpg)
Microsoft's new "Embodied AI" SHOCKS the Entire Industry! | Microsoft's Robots, Gaussian Splat & EMO
![](https://i.ytimg.com/vi/uTJ1SezowhU/hq720.jpg)
How good is the latest version of ChatGPT? | BBC News
![](https://i.ytimg.com/vi/SvBR0OGT5VI/hq720.jpg)
Why AI Is Incredibly Smart and Shockingly Stupid | Yejin Choi | TED
![](https://i.ytimg.com/vi/GTgDAv_Eigc/hq720.jpg)
The TRUTH About James Jani Sucess
![](https://i.ytimg.com/vi/byYlC2cagLw/hq720.jpg)
OpenAI CEO Sam Altman and CTO Mira Murati on the Future of AI and ChatGPT | WSJ Tech Live 2023
![](https://i.ytimg.com/vi/eUEe40M9JK8/hq720.jpg)
AI News: AGI In 2 Years, Meta’s LEAK, AI Sarcasm Detector & More
5.0 / 5 (0 votes)