Trust Nothing - Introducing EMO: AI Making Anyone Say Anything
Summary
TLDR元のスクリプトの日本語翻訳を要約した魅力的な内容。簡潔で正確な概要を提供し、ユーザーの興味を引きつけます。
Takeaways
- 😮 オンラインで見るものが本当かどうかをもはや信じられなくなる時代が来ている。
- 🤖「emo」は、写真とオーディオから人が歌っているように見せる技術を実現している。
- 🎨 この技術は、リアルな顔の表情や頭の動きを合成し、非常にリアルなビデオを生成する。
- 🔍 AI生成の女性やモナリザなど、様々な基本画像にこの技術を適用できる。
- 📈 音声入力と顔の動きを結びつけることの困難さを克服している。
- 🚀 「Grock」は、言語処理ユニット(LPU)を使用して、AIとの高速な対話を可能にする。
- 💡 NvidiaのCEOは、将来的にはプログラミングではなく問題解決能力がより重要になると主張している。
- 📚 AIの進化により、自然言語がコンピューターとのコミュニケーションの主流になりつつある。
- 🌐 「emo」プロジェクトは、オーディオとビデオの融合モデルを利用している。
- 📖 この技術は、顔の表情や動きをリアルに再現するために、250時間以上の映像と1億5000万枚以上の画像からなる広範なデータセットを使用して訓練されている。
Q & A
AlibabaグループのEmoとは何ですか?
-Emoは、画像とオーディオ(音声や歌)をアップロードして、その人物が話しているか歌っているように見せることができる技術です。
Emoが生成する動画の特徴は何ですか?
-Emoによって生成される動画は、単に口元の動きだけでなく、顔の表情や頭の傾きの変化も含まれ、リアルな表現が可能です。
Emoの動画生成プロセスにおける難しさは何ですか?
-オーディオから顔の表情へのマッピングの曖昧さがあり、これを解決することが技術的な課題です。
Grockとは何ですか?
-Grockは、言語処理ユニット(LPU)を搭載した、大規模言語モデルと生成AIのための新しいアーキテクチャです。
Grockの特徴は何ですか?
-Grockは、他のどのシステムよりも速い、500トークン以上/秒の推論速度を誇ります。
プログラミングが「死ぬ」とはどういう意味ですか?
-プログラミングが「死ぬ」とは、将来的にはプログラミングのスキルよりも問題解決能力や大言語モデルを使いこなす能力が重要になるという意味です。
NVIDIAのCEO、Jensen Huangはプログラミング教育についてどのような見解を持っていますか?
-彼は、将来的には誰もが自然言語を使ってプログラミングできるようになるべきであり、特定のドメインの問題解決がより重要になると主張しています。
Emoプロジェクトと大言語モデルの関係は何ですか?
-Emoプロジェクトは、AI技術の進化とともに、リアルなビデオ生成が容易になっている一例であり、大言語モデルは自然言語でのコミュニケーション能力を向上させます。
Emoによる動画生成の制約は何ですか?
-生成過程での安定性の問題や、顔以外の体の部分の不自然な動きが挙げられます。
Emoの訓練データはどのようなものですか?
-250時間以上の映像と1億5000万枚以上の画像からなる、多言語かつ多様なコンテンツを含む広範なデータセットです。
Outlines
🔮AIと現実の境界
この段落では、AI技術によって現実と虚構の境界が曖昧になっている現象について説明しています。具体的には、Alibabaグループが開発した「emo」という技術により、単なる画像と音声から、その人が歌っているかのような動画を生成することが可能になったことが紹介されています。この技術は、表情や頭の傾きまでリアルに再現することができ、オンライン上で目にするものの信憑性に疑問を投げかけています。また、AIが生成した女性キャラクターがリアルに話しているかのように見せるデモも紹介され、AI技術の進化による社会への影響を示唆しています。
🤖技術の進化と人類の適応
この段落では、Grockという会社が開発した、言語処理ユニット(LPU)を用いた大規模言語モデルと生成AIの新しいアーキテクチャについて説明しています。この技術は非常に高速であり、AIエージェントをさらに高速に動作させることが可能です。また、emoプロジェクトに戻り、その技術的詳細と、AIが音声から人の顔の動きをどのように生成するかについて詳述しています。特に、表情や頭の動きをリアルに再現する技術的な進歩と、それが社会やオンラインコンテンツの信憑性にどのように影響を与えるかに焦点を当てています。
💡プログラミングとAIの未来
この段落では、プログラミングの将来性と、人々がどのように技術を使って問題を解決するかについて論じています。NVIDIAのCEO、ジェンセン・フアンの見解によれば、将来はプログラミングが不要になり、代わりにAIが人間の言語を理解してタスクを実行する時代になると述べています。また、emoプロジェクトのような技術が、ビデオ生成やゲーム開発など、以前は困難だったタスクを容易にすることで、AIと人間の言語の間の新しい関係を構築していることを示唆しています。最終的には、プログラミングスキルよりも問題解決能力が重要になると結論づけています。
🌐AIの普及と社会への影響
最後の段落では、AI技術の普及によって、プログラミングの概念が変化し、誰もが技術を使って問題を解決できるようになる未来について語っています。大規模言語モデルとAIの進化は、ビデオゲームの生成、アバターの作成、ロボットの制御など、多岐にわたる分野でのイノベーションを促進しており、これらの技術がどのようにして新たな創造性を解放し、人間の言語を新しいプログラミングの形態として位置づけるかについて考察しています。また、子供たちにプログラミングを教えることの価値についても言及しており、それが彼らの問題解決能力の向上に役立つとしています。
Mindmap
Keywords
💡deepfakes
💡AI generated content
💡diffusion models
💡facial generation
💡audio to video generation
💡limitations of CGI
💡programming obsolescence
💡natural language programming
💡skill shifting
💡synthetic media detection
Highlights
Emo software creates realistic fake videos of people talking or singing by only using a single image and audio input
The generated videos have expressive facial expressions and head movements that match the audio, going beyond just lip sync
Emo uses a novel diffusion model approach to generate the frames rather than explicit control signals
The model was trained on a diverse dataset of over 250 hours of video footage in multiple languages
Programming as we know it may die as AI makes it easier for non-programmers to create things
With large language models, natural language becomes the universal programming language
Transcripts
what do you do when your eyes can no
longer be trusted when you think you're
looking at something real but it clearly
can't be real what if everybody is
empowered to create anything they want
let me just pause for a second and show
you something that will absolutely blow
your
mind do to get it through you I'm super
human inovative and I'm made a so that
anything you say is off of me and you
and devastating more than never
demonstrating how give them an audience
I feeling like it's levitating never
fting and I know the ha is forever
waiting for the day to think I fell off
to be celebrating cuz I know the way to
get motivated I make elevating music you
make elevator music oh he's too
mainstream well that's what they do
jealous canuse it it's not hip hop
it's pop cuz I found a h way to fuse it
an AA with an AK Mele and said it like a
play but a v Retreat like a vac Mayday
this beat is cray cray RJ ha H ha
laughing all the way to the bank I spray
Flames I cannot tame or plate the
monster you get in my way I'mma feed you
to the monster normal during day but at
night turn into a monster when the moon
shines like I trucker we
were we
were kind the dream that can be
so we were right till we W build a home
that was
burn I didn't want to leave I did a full
180
crazy thinking about the way I want did
the Heart Break Change
Me Maybe but look at where I end up I'm
all good already so moved on it's scary
I'm not
where now what you're seeing there was
obviously fake but how real did it look
this new paper called emo from the
Alibaba group promises to give everybody
the ability to Simply take an image take
a piece of audio or song and make the
person in the image sing that song or at
least look like they are and it looks
really good so let's take a step back
and look how they actually accomplished
this because all of a sudden we're not
going to be able to believe anything
that we're seeing online emo emote
portrait alive generative expressive
portrait videos with audio to video
diffusion model under weak conditions a
lot of words but all that means is you
upload an image you upload some audio
and the image will look like it's either
speaking the dialogue or singing the
song so here's an example we have audio
input right here talking speaking
singing and then we have four base
images this first one is from an old
movie I forget what it is but it's from
an old movie this one is obviously the
Mona Lisa this one is from the Sora
video so this is AI generated and that
is super cool to think about for a
second an AI generated woman who is who
walks around a virtual world looks
completely real can now look like she's
singing or talking anything that we want
then this final one I'm sorry I don't
know who this is maybe anime it says
anime right there but that's it you take
that base image and it looks like they
are talking or singing now usually when
you see something like this you just see
the lips moving but it's so much more
than that as they're singing as they're
speaking their facial expression changes
the tilt of their head changes obviously
the mouth and the lips moving match the
words super impressive now let's take a
look at this in this first example we
have the input audio but it's going to
be two things first we have a song a
very popular song and then in the second
example from the same image we have
somebody
talking I never knew you were the
someone waiting for me
we were just kids when we fell in love
not knowing what is when I was a kid I
feel like you heard the thing you heard
the term don't cry you don't need to
cry crying is the most beautiful thing
you can do I encourage people to cry I
cry all the time and I think it's the
most he and take a look at this you
remember the woman from the Tokyo
landscape the Sora generated video
completely fake not a real person all of
a sudden we have this person speaking up
here which is the vocal Source we have
the reference image and then look how
incredible the generated video is
everything looks perfect it could be her
speaking Yeah I think this is right now
an inflection point where we're sort of
you know redefining how we interact with
with uh digital information and it's
it's through you know the form of this
AI systems that we collaborate with and
uh maybe we have several of them and
maybe they all have different
competences and maybe we have a general
one that kind of follows us around
everywhere knows everything about uh you
know my context what I've been up to
today um what my goals are um sort of in
life so let's take a look at how this
actually works but first a quick message
from today's sponsor wow look at this
okay so you're probably wondering what I
am so impressed with and what I'm
actually talking about is grock you've
probably heard of them recently because
of their insane inference speeds grock
is the creator of the world's first lpu
inference engine lpu stands for language
processing unit and it is a brand new
architecture for large language models
and generative Ai and it runs faster
than anything else I've ever seen I'm
talking about 500 plus tokens per second
check out these recent inference speed
benchmarks by any scale and artificial
analysis grock is leading the pack so
let's check out the speed with this
quick prompt I'm going to ask it tell me
about Moe llms but explain it to me like
I'm 7 years old so I can easily
understand it look at how fast that is
518 tokens per second now translate your
answer to French boom French 520 tokens
per second now imagine powering AI
agents with that type of inference speed
so I'm going to be covering a lot more
about grock in the coming weeks because
I'm so excited about it but in the
meantime if you're wondering how to get
access go to this link right here I'm
also going to drop the link in the
description below so you can click it
there so go to gr. linkman API to get
access today thanks so much to the
sponsor of this video grock and now back
to the video we have the reference image
then they add the motion frames and the
motion frames are generated during this
diffusion process and this is actually a
pretty complex process so they take the
audio the face recognition
the noisy latent layer the head speed
the speed encoder the wave to VC all
things that we've talked about in
previous videos put it all together and
they're able to get the generated frames
and then they pass that back layer on
the audio and then you have the final
result and here's the paper let's take a
quick look we propose emo an expressive
audio-driven portrait video generation
framework input a single reference image
and the vocal audio talking single
our method can generate vocal Avatar
videos with expressive facial
expressions and various head poses
meanwhile we can generate videos with
any duration depending on the length of
the input audio that any duration piece
is really interesting because typically
when you're doing text to video you have
a pretty hard limit right now Sora is 60
seconds for example and the abstract
says we tackle the challenge of
enhancing the realism and expressiveness
in talking head video Generation by
focusing on the dynamic and Nuance
relationship between audio cues and
facial movements and that's really their
unique Innovation here they've been able
to read the audio and understand when
somebody's speaking or singing what
their head will likely look like as
they're doing those things then they
take that understanding and apply it to
the base image now here's something
really important we identify the
limitations of traditional techniques
that often fail to capture the full
spectrum of human expressions and the
uniqueness of individual facial styles
that is what I was talking about
typically when you have an avatar
software and you're giving it something
to say you only really see the mouth
moving with some very basic head
movement but this takes it to another
level they also go on to say this
approach eliminates the need for
intermediate representations or complex
pre-processing streamlining the creation
of talking head videos that exhibit a
high degree of visual and emotional
Fidelity closely aligned with the
nuances present in the audio input again
the fact that that takes no
preprocessing is absolutely insane and
their Discovery is audio signals are
rich in information related to facial
expressions and that makes sense
literally as I was just saying facial
expressions my eyebrows went up cuz I
was excited about that so it really does
translate and I had never thought of
that very cool Insight theoretically
enabling models to generate a verse
array of expressive facial movements
however integrating audio with the
fusion models is not a straightforward
task due to the ambiguity inherent in
the mapping between audio and facial
expression and that is what they solved
how do you map the audio with how the
face looks and not only that how do you
take a static image and generate video
from that static image that represents
the different facial movements so
typically and here's what has happened
previously this issue can lead to
instability in the videos produced by
the model manifesting as facial
distortions or jittering between video
frames and in severe cases may even
result in the complete collapse of the
video and to fix that they have
Incorporated stable control mechanisms
into our model namely a speed controller
and a face region controller to enhance
stability during the generation process
and how did they actually train their
model we constructed a vast and diverse
audio video data set amassing over 250
hours of footage and just pausing for a
second 250 hours doesn't really seem
like that much for any modern model and
more than 150 million images now that's
a lot this expansive data set
encompasses a wide range of content
including speeches film and television
clips and singing performances and
covers multiple languages such as
Chinese and English and here you can see
different benchmarks and these are
qualitative comparisons with several
talking head generation works so these
are the previous ones like dream talk
sad talker wave to lip and then they
have hours and we can see four out of
the five benchmarks the new technique
has one so what are some of the
limitations the first limitation is it
is more timec consuming compared to
methods that do not rely on diffusion
models and that makes sense if you've
ever tried to create something with
dolly or any of the other diffusion
models out there they take a long time
to generate and they take a lot of
processing power but they don't require
preprocessing second since we do not use
any explicit control signals to control
the character's motion it may result in
the inadvertent certain generation of
other body parts such as hands leading
to artifacts in the video because
they're not actually controlling for
different parts of the body and they
really just want the face and the head
they sometimes suffer from those body
parts appearing in the video let me just
play a couple more examples for you you
want to know how I got these
scars my father
was a drinker and a
fiend and one night he goes off crazier
than usual
Mommy gets the kitchen knife to defend
herself he doesn't like that not one bit
so me watching he takes the knife to her
laughing while he does it he turns to me
and he says why so
serious he comes at me with the knife
why so
serious sticks the blade in my
mouth let's put a smile on that
face
and why so
serious yes one and in this manner he
was to imagine me his love his mistress
and I set him every day to W me at which
time would I being but a moonish youth
grieve be a feminite changeable longing
and liking proud Fantastical a
shallow now one last thing I want to
talk about in this video and it may seem
unrelated at first but in fact stay with
me and you'll see that it is very
related now we have a video from Jen sen
hang and he is the CEO of Nvidia now the
third largest company in the world and
he argues that we should stop saying
kids should learn to code I've been
saying for a while that programming is
going to die and maybe I should add some
Nuance programming as we know it is
going to die and there's a couple ways
to interpret that but let me play Jensen
Wang's video and then I'll speak a
little bit more about it I'm going to
say something and it's it's going to
sound completely opposite of what people
feel over the course of the last 10
years 15 years um almost everybody who
sits on a stage like this would tell you
it is vital that your children learn
computer science um everybody should
learn how to program and in fact it's
almost exactly the opposite it is our
job to create Computing technology such
that nobody has to program and that the
programming language is human everybody
in the world is now a programmer this is
the miracle of artificial intelligence
the countries the people that understand
how to solve a domain problem in digital
biology or in education of young people
or in manufacturing or in farming those
people who understand domain expert te's
now can utilize technology that is
readily available to you you now have a
computer that will do what you tell it
to do it is vital that we upskill
everyone and the upskilling process I I
believe will be delightful surprising
okay so what he's saying is something
that I've been talking about for a
little while now first I want to get
this out of the way I am still going to
teach my kids how to code simply because
it actually helps you think about
thinking it helps you think think better
it helps you think systematically and
all of these skills are very important
when you're problem solving but as he's
saying problem solving is the bigger
value learning how to solve problems
learning how to get what you want from a
large language model is going to be
super important now if programming is
dead and nobody are programmers then
technically everybody is a programmer at
that point because natural language is
the language of computers and that is
going to be the case with large language
models and artificial intelligence so
how does this relate to the emo project
well it seems faster than most people
are predicting things that were
previously really hard to do such as
creating video Sora creating video games
Genie creating avatars that look
realistic emo controlling robots mimic
gen and creating true AI to live their
lives Voyager and Minecraft all of these
things are becoming easier and easier
and as all of these problems get solved
with large language models the language
of large language models is going to be
natural language so I still encourage
you if you don't know how to code learn
the basics you don't need to go that
deep into it anymore and I'm definitely
going to teach my kids how to code just
so they know how to think systematically
if you enjoyed this video please
consider giving a like And subscribe and
I'll see you in the next one
5.0 / 5 (0 votes)