Trust Nothing - Introducing EMO: AI Making Anyone Say Anything

Matthew Berman
29 Feb 202416:27

Summary

TLDR元のスクリプトの日本語翻訳を要約した魅力的な内容。簡潔で正確な概要を提供し、ユーザーの興味を引きつけます。

Takeaways

  • 😮 オンラインで見るものが本当かどうかをもはや信じられなくなる時代が来ている。
  • 🤖「emo」は、写真とオーディオから人が歌っているように見せる技術を実現している。
  • 🎨 この技術は、リアルな顔の表情や頭の動きを合成し、非常にリアルなビデオを生成する。
  • 🔍 AI生成の女性やモナリザなど、様々な基本画像にこの技術を適用できる。
  • 📈 音声入力と顔の動きを結びつけることの困難さを克服している。
  • 🚀 「Grock」は、言語処理ユニット(LPU)を使用して、AIとの高速な対話を可能にする。
  • 💡 NvidiaのCEOは、将来的にはプログラミングではなく問題解決能力がより重要になると主張している。
  • 📚 AIの進化により、自然言語がコンピューターとのコミュニケーションの主流になりつつある。
  • 🌐 「emo」プロジェクトは、オーディオとビデオの融合モデルを利用している。
  • 📖 この技術は、顔の表情や動きをリアルに再現するために、250時間以上の映像と1億5000万枚以上の画像からなる広範なデータセットを使用して訓練されている。

Q & A

  • AlibabaグループのEmoとは何ですか?

    -Emoは、画像とオーディオ(音声や歌)をアップロードして、その人物が話しているか歌っているように見せることができる技術です。

  • Emoが生成する動画の特徴は何ですか?

    -Emoによって生成される動画は、単に口元の動きだけでなく、顔の表情や頭の傾きの変化も含まれ、リアルな表現が可能です。

  • Emoの動画生成プロセスにおける難しさは何ですか?

    -オーディオから顔の表情へのマッピングの曖昧さがあり、これを解決することが技術的な課題です。

  • Grockとは何ですか?

    -Grockは、言語処理ユニット(LPU)を搭載した、大規模言語モデルと生成AIのための新しいアーキテクチャです。

  • Grockの特徴は何ですか?

    -Grockは、他のどのシステムよりも速い、500トークン以上/秒の推論速度を誇ります。

  • プログラミングが「死ぬ」とはどういう意味ですか?

    -プログラミングが「死ぬ」とは、将来的にはプログラミングのスキルよりも問題解決能力や大言語モデルを使いこなす能力が重要になるという意味です。

  • NVIDIAのCEO、Jensen Huangはプログラミング教育についてどのような見解を持っていますか?

    -彼は、将来的には誰もが自然言語を使ってプログラミングできるようになるべきであり、特定のドメインの問題解決がより重要になると主張しています。

  • Emoプロジェクトと大言語モデルの関係は何ですか?

    -Emoプロジェクトは、AI技術の進化とともに、リアルなビデオ生成が容易になっている一例であり、大言語モデルは自然言語でのコミュニケーション能力を向上させます。

  • Emoによる動画生成の制約は何ですか?

    -生成過程での安定性の問題や、顔以外の体の部分の不自然な動きが挙げられます。

  • Emoの訓練データはどのようなものですか?

    -250時間以上の映像と1億5000万枚以上の画像からなる、多言語かつ多様なコンテンツを含む広範なデータセットです。

Outlines

00:00

🔮AIと現実の境界

この段落では、AI技術によって現実と虚構の境界が曖昧になっている現象について説明しています。具体的には、Alibabaグループが開発した「emo」という技術により、単なる画像と音声から、その人が歌っているかのような動画を生成することが可能になったことが紹介されています。この技術は、表情や頭の傾きまでリアルに再現することができ、オンライン上で目にするものの信憑性に疑問を投げかけています。また、AIが生成した女性キャラクターがリアルに話しているかのように見せるデモも紹介され、AI技術の進化による社会への影響を示唆しています。

05:03

🤖技術の進化と人類の適応

この段落では、Grockという会社が開発した、言語処理ユニット(LPU)を用いた大規模言語モデルと生成AIの新しいアーキテクチャについて説明しています。この技術は非常に高速であり、AIエージェントをさらに高速に動作させることが可能です。また、emoプロジェクトに戻り、その技術的詳細と、AIが音声から人の顔の動きをどのように生成するかについて詳述しています。特に、表情や頭の動きをリアルに再現する技術的な進歩と、それが社会やオンラインコンテンツの信憑性にどのように影響を与えるかに焦点を当てています。

10:05

💡プログラミングとAIの未来

この段落では、プログラミングの将来性と、人々がどのように技術を使って問題を解決するかについて論じています。NVIDIAのCEO、ジェンセン・フアンの見解によれば、将来はプログラミングが不要になり、代わりにAIが人間の言語を理解してタスクを実行する時代になると述べています。また、emoプロジェクトのような技術が、ビデオ生成やゲーム開発など、以前は困難だったタスクを容易にすることで、AIと人間の言語の間の新しい関係を構築していることを示唆しています。最終的には、プログラミングスキルよりも問題解決能力が重要になると結論づけています。

15:05

🌐AIの普及と社会への影響

最後の段落では、AI技術の普及によって、プログラミングの概念が変化し、誰もが技術を使って問題を解決できるようになる未来について語っています。大規模言語モデルとAIの進化は、ビデオゲームの生成、アバターの作成、ロボットの制御など、多岐にわたる分野でのイノベーションを促進しており、これらの技術がどのようにして新たな創造性を解放し、人間の言語を新しいプログラミングの形態として位置づけるかについて考察しています。また、子供たちにプログラミングを教えることの価値についても言及しており、それが彼らの問題解決能力の向上に役立つとしています。

Mindmap

Keywords

💡deepfakes

Deepfakes refers to synthetic media where a person's likeness or voice is inserted into existing images or video to make them appear to say or do things they did not. This technology is used in the video to show examples of AI-generated content that looks real but is fabricated.

💡AI generated content

The video focuses heavily on AI's ability to create increasingly realistic fabricated audio, images, and video known as AI generated content. Examples given include the 'Sora' virtual AI vlogger and tools like emoji that can animate faces.

💡diffusion models

Diffusion models are a type of generative AI system used to create high-fidelity synthetic media. The emoji tool relies on diffusion models to animate faces based on audio with nuanced facial expressions.

💡facial generation

A key aspect highlighted in emoji is the AI's ability to accurately animate facial movements and expressions based on audio input to match the emotional cadence of voices.

💡audio to video generation

The core innovation discussed is emoji's ability to generate video of talking heads from only an image and vocal audio with no other inputs or preprocessing needed.

💡limitations of CGI

The paper explains shortcomings of previous computer-generated video attempts in failing to capture nuanced facial expressions and emotion, which emoji aims to solve.

💡programming obsolescence

A later segment argues innovations like emoji that allow directing AIs through plain language instead of code may soon make programming obsolete.

💡natural language programming

With AI assistants that can parse commands in natural language, the video claims everyone can become a 'programmer' without needing to code if language models advance sufficiently.

💡skill shifting

Rather than code, the video advises developing problem solving skills and learning to effectively formulate instructions for language models as critical for the future.

💡synthetic media detection

While not directly stated, the subtext of advances like emoji is that as AI-generated media becomes more sophisticated, we need enhanced tools and awareness to determine authenticity.

Highlights

Emo software creates realistic fake videos of people talking or singing by only using a single image and audio input

The generated videos have expressive facial expressions and head movements that match the audio, going beyond just lip sync

Emo uses a novel diffusion model approach to generate the frames rather than explicit control signals

The model was trained on a diverse dataset of over 250 hours of video footage in multiple languages

Programming as we know it may die as AI makes it easier for non-programmers to create things

With large language models, natural language becomes the universal programming language

Transcripts

play00:00

what do you do when your eyes can no

play00:02

longer be trusted when you think you're

play00:05

looking at something real but it clearly

play00:08

can't be real what if everybody is

play00:11

empowered to create anything they want

play00:14

let me just pause for a second and show

play00:17

you something that will absolutely blow

play00:19

your

play00:23

mind do to get it through you I'm super

play00:25

human inovative and I'm made a so that

play00:26

anything you say is off of me and you

play00:28

and devastating more than never

play00:29

demonstrating how give them an audience

play00:30

I feeling like it's levitating never

play00:32

fting and I know the ha is forever

play00:33

waiting for the day to think I fell off

play00:34

to be celebrating cuz I know the way to

play00:35

get motivated I make elevating music you

play00:37

make elevator music oh he's too

play00:39

mainstream well that's what they do

play00:40

jealous canuse it it's not hip hop

play00:42

it's pop cuz I found a h way to fuse it

play00:45

an AA with an AK Mele and said it like a

play00:47

play but a v Retreat like a vac Mayday

play00:50

this beat is cray cray RJ ha H ha

play00:52

laughing all the way to the bank I spray

play00:54

Flames I cannot tame or plate the

play00:55

monster you get in my way I'mma feed you

play00:57

to the monster normal during day but at

play01:00

night turn into a monster when the moon

play01:02

shines like I trucker we

play01:06

were we

play01:08

were kind the dream that can be

play01:12

so we were right till we W build a home

play01:18

that was

play01:21

burn I didn't want to leave I did a full

play01:27

180

play01:28

crazy thinking about the way I want did

play01:32

the Heart Break Change

play01:35

Me Maybe but look at where I end up I'm

play01:41

all good already so moved on it's scary

play01:45

I'm not

play01:46

where now what you're seeing there was

play01:49

obviously fake but how real did it look

play01:53

this new paper called emo from the

play01:55

Alibaba group promises to give everybody

play01:58

the ability to Simply take an image take

play02:01

a piece of audio or song and make the

play02:04

person in the image sing that song or at

play02:06

least look like they are and it looks

play02:08

really good so let's take a step back

play02:10

and look how they actually accomplished

play02:12

this because all of a sudden we're not

play02:15

going to be able to believe anything

play02:16

that we're seeing online emo emote

play02:19

portrait alive generative expressive

play02:22

portrait videos with audio to video

play02:24

diffusion model under weak conditions a

play02:27

lot of words but all that means is you

play02:30

upload an image you upload some audio

play02:32

and the image will look like it's either

play02:35

speaking the dialogue or singing the

play02:37

song so here's an example we have audio

play02:40

input right here talking speaking

play02:42

singing and then we have four base

play02:44

images this first one is from an old

play02:47

movie I forget what it is but it's from

play02:49

an old movie this one is obviously the

play02:51

Mona Lisa this one is from the Sora

play02:54

video so this is AI generated and that

play02:57

is super cool to think about for a

play02:59

second an AI generated woman who is who

play03:03

walks around a virtual world looks

play03:05

completely real can now look like she's

play03:07

singing or talking anything that we want

play03:10

then this final one I'm sorry I don't

play03:12

know who this is maybe anime it says

play03:15

anime right there but that's it you take

play03:17

that base image and it looks like they

play03:19

are talking or singing now usually when

play03:21

you see something like this you just see

play03:24

the lips moving but it's so much more

play03:27

than that as they're singing as they're

play03:29

speaking their facial expression changes

play03:32

the tilt of their head changes obviously

play03:36

the mouth and the lips moving match the

play03:39

words super impressive now let's take a

play03:42

look at this in this first example we

play03:44

have the input audio but it's going to

play03:46

be two things first we have a song a

play03:48

very popular song and then in the second

play03:49

example from the same image we have

play03:52

somebody

play03:53

talking I never knew you were the

play03:56

someone waiting for me

play04:00

we were just kids when we fell in love

play04:05

not knowing what is when I was a kid I

play04:09

feel like you heard the thing you heard

play04:11

the term don't cry you don't need to

play04:15

cry crying is the most beautiful thing

play04:17

you can do I encourage people to cry I

play04:20

cry all the time and I think it's the

play04:22

most he and take a look at this you

play04:26

remember the woman from the Tokyo

play04:29

landscape the Sora generated video

play04:31

completely fake not a real person all of

play04:33

a sudden we have this person speaking up

play04:36

here which is the vocal Source we have

play04:38

the reference image and then look how

play04:40

incredible the generated video is

play04:43

everything looks perfect it could be her

play04:46

speaking Yeah I think this is right now

play04:48

an inflection point where we're sort of

play04:50

you know redefining how we interact with

play04:53

with uh digital information and it's

play04:57

it's through you know the form of this

play04:59

AI systems that we collaborate with and

play05:03

uh maybe we have several of them and

play05:05

maybe they all have different

play05:07

competences and maybe we have a general

play05:10

one that kind of follows us around

play05:12

everywhere knows everything about uh you

play05:15

know my context what I've been up to

play05:18

today um what my goals are um sort of in

play05:22

life so let's take a look at how this

play05:24

actually works but first a quick message

play05:26

from today's sponsor wow look at this

play05:29

okay so you're probably wondering what I

play05:31

am so impressed with and what I'm

play05:33

actually talking about is grock you've

play05:36

probably heard of them recently because

play05:38

of their insane inference speeds grock

play05:41

is the creator of the world's first lpu

play05:44

inference engine lpu stands for language

play05:46

processing unit and it is a brand new

play05:48

architecture for large language models

play05:50

and generative Ai and it runs faster

play05:53

than anything else I've ever seen I'm

play05:55

talking about 500 plus tokens per second

play05:58

check out these recent inference speed

play06:00

benchmarks by any scale and artificial

play06:03

analysis grock is leading the pack so

play06:06

let's check out the speed with this

play06:07

quick prompt I'm going to ask it tell me

play06:09

about Moe llms but explain it to me like

play06:11

I'm 7 years old so I can easily

play06:13

understand it look at how fast that is

play06:15

518 tokens per second now translate your

play06:18

answer to French boom French 520 tokens

play06:21

per second now imagine powering AI

play06:23

agents with that type of inference speed

play06:25

so I'm going to be covering a lot more

play06:27

about grock in the coming weeks because

play06:28

I'm so excited about it but in the

play06:30

meantime if you're wondering how to get

play06:31

access go to this link right here I'm

play06:34

also going to drop the link in the

play06:35

description below so you can click it

play06:37

there so go to gr. linkman API to get

play06:40

access today thanks so much to the

play06:42

sponsor of this video grock and now back

play06:44

to the video we have the reference image

play06:47

then they add the motion frames and the

play06:50

motion frames are generated during this

play06:52

diffusion process and this is actually a

play06:55

pretty complex process so they take the

play06:57

audio the face recognition

play07:00

the noisy latent layer the head speed

play07:02

the speed encoder the wave to VC all

play07:05

things that we've talked about in

play07:06

previous videos put it all together and

play07:08

they're able to get the generated frames

play07:11

and then they pass that back layer on

play07:13

the audio and then you have the final

play07:16

result and here's the paper let's take a

play07:19

quick look we propose emo an expressive

play07:21

audio-driven portrait video generation

play07:24

framework input a single reference image

play07:26

and the vocal audio talking single

play07:29

our method can generate vocal Avatar

play07:32

videos with expressive facial

play07:34

expressions and various head poses

play07:37

meanwhile we can generate videos with

play07:38

any duration depending on the length of

play07:41

the input audio that any duration piece

play07:43

is really interesting because typically

play07:46

when you're doing text to video you have

play07:48

a pretty hard limit right now Sora is 60

play07:51

seconds for example and the abstract

play07:53

says we tackle the challenge of

play07:55

enhancing the realism and expressiveness

play07:57

in talking head video Generation by

play07:59

focusing on the dynamic and Nuance

play08:01

relationship between audio cues and

play08:02

facial movements and that's really their

play08:05

unique Innovation here they've been able

play08:08

to read the audio and understand when

play08:10

somebody's speaking or singing what

play08:12

their head will likely look like as

play08:15

they're doing those things then they

play08:17

take that understanding and apply it to

play08:19

the base image now here's something

play08:21

really important we identify the

play08:23

limitations of traditional techniques

play08:24

that often fail to capture the full

play08:26

spectrum of human expressions and the

play08:28

uniqueness of individual facial styles

play08:31

that is what I was talking about

play08:33

typically when you have an avatar

play08:35

software and you're giving it something

play08:36

to say you only really see the mouth

play08:39

moving with some very basic head

play08:41

movement but this takes it to another

play08:44

level they also go on to say this

play08:46

approach eliminates the need for

play08:48

intermediate representations or complex

play08:50

pre-processing streamlining the creation

play08:53

of talking head videos that exhibit a

play08:55

high degree of visual and emotional

play08:57

Fidelity closely aligned with the

play08:59

nuances present in the audio input again

play09:03

the fact that that takes no

play09:05

preprocessing is absolutely insane and

play09:08

their Discovery is audio signals are

play09:11

rich in information related to facial

play09:13

expressions and that makes sense

play09:15

literally as I was just saying facial

play09:17

expressions my eyebrows went up cuz I

play09:19

was excited about that so it really does

play09:22

translate and I had never thought of

play09:24

that very cool Insight theoretically

play09:27

enabling models to generate a verse

play09:29

array of expressive facial movements

play09:31

however integrating audio with the

play09:33

fusion models is not a straightforward

play09:34

task due to the ambiguity inherent in

play09:37

the mapping between audio and facial

play09:39

expression and that is what they solved

play09:41

how do you map the audio with how the

play09:44

face looks and not only that how do you

play09:47

take a static image and generate video

play09:49

from that static image that represents

play09:51

the different facial movements so

play09:53

typically and here's what has happened

play09:55

previously this issue can lead to

play09:57

instability in the videos produced by

play09:59

the model manifesting as facial

play10:01

distortions or jittering between video

play10:03

frames and in severe cases may even

play10:04

result in the complete collapse of the

play10:06

video and to fix that they have

play10:09

Incorporated stable control mechanisms

play10:11

into our model namely a speed controller

play10:14

and a face region controller to enhance

play10:17

stability during the generation process

play10:20

and how did they actually train their

play10:21

model we constructed a vast and diverse

play10:24

audio video data set amassing over 250

play10:27

hours of footage and just pausing for a

play10:30

second 250 hours doesn't really seem

play10:32

like that much for any modern model and

play10:35

more than 150 million images now that's

play10:37

a lot this expansive data set

play10:39

encompasses a wide range of content

play10:41

including speeches film and television

play10:43

clips and singing performances and

play10:46

covers multiple languages such as

play10:47

Chinese and English and here you can see

play10:50

different benchmarks and these are

play10:52

qualitative comparisons with several

play10:53

talking head generation works so these

play10:56

are the previous ones like dream talk

play10:57

sad talker wave to lip and then they

play11:00

have hours and we can see four out of

play11:02

the five benchmarks the new technique

play11:05

has one so what are some of the

play11:07

limitations the first limitation is it

play11:08

is more timec consuming compared to

play11:10

methods that do not rely on diffusion

play11:12

models and that makes sense if you've

play11:14

ever tried to create something with

play11:15

dolly or any of the other diffusion

play11:17

models out there they take a long time

play11:19

to generate and they take a lot of

play11:20

processing power but they don't require

play11:22

preprocessing second since we do not use

play11:25

any explicit control signals to control

play11:26

the character's motion it may result in

play11:28

the inadvertent certain generation of

play11:30

other body parts such as hands leading

play11:32

to artifacts in the video because

play11:33

they're not actually controlling for

play11:34

different parts of the body and they

play11:36

really just want the face and the head

play11:38

they sometimes suffer from those body

play11:40

parts appearing in the video let me just

play11:42

play a couple more examples for you you

play11:44

want to know how I got these

play11:47

scars my father

play11:50

was a drinker and a

play11:54

fiend and one night he goes off crazier

play11:58

than usual

play11:59

Mommy gets the kitchen knife to defend

play12:01

herself he doesn't like that not one bit

play12:08

so me watching he takes the knife to her

play12:12

laughing while he does it he turns to me

play12:16

and he says why so

play12:20

serious he comes at me with the knife

play12:24

why so

play12:27

serious sticks the blade in my

play12:30

mouth let's put a smile on that

play12:35

face

play12:39

and why so

play12:42

serious yes one and in this manner he

play12:47

was to imagine me his love his mistress

play12:51

and I set him every day to W me at which

play12:54

time would I being but a moonish youth

play12:58

grieve be a feminite changeable longing

play13:02

and liking proud Fantastical a

play13:07

shallow now one last thing I want to

play13:09

talk about in this video and it may seem

play13:12

unrelated at first but in fact stay with

play13:15

me and you'll see that it is very

play13:17

related now we have a video from Jen sen

play13:19

hang and he is the CEO of Nvidia now the

play13:23

third largest company in the world and

play13:25

he argues that we should stop saying

play13:27

kids should learn to code I've been

play13:29

saying for a while that programming is

play13:31

going to die and maybe I should add some

play13:33

Nuance programming as we know it is

play13:36

going to die and there's a couple ways

play13:38

to interpret that but let me play Jensen

play13:40

Wang's video and then I'll speak a

play13:42

little bit more about it I'm going to

play13:43

say something and it's it's going to

play13:45

sound completely opposite of what people

play13:47

feel over the course of the last 10

play13:48

years 15 years um almost everybody who

play13:51

sits on a stage like this would tell you

play13:53

it is vital that your children learn

play13:55

computer science um everybody should

play13:58

learn how to program and in fact it's

play13:59

almost exactly the opposite it is our

play14:02

job to create Computing technology such

play14:05

that nobody has to program and that the

play14:08

programming language is human everybody

play14:10

in the world is now a programmer this is

play14:13

the miracle of artificial intelligence

play14:15

the countries the people that understand

play14:17

how to solve a domain problem in digital

play14:20

biology or in education of young people

play14:24

or in manufacturing or in farming those

play14:26

people who understand domain expert te's

play14:29

now can utilize technology that is

play14:32

readily available to you you now have a

play14:34

computer that will do what you tell it

play14:35

to do it is vital that we upskill

play14:37

everyone and the upskilling process I I

play14:40

believe will be delightful surprising

play14:43

okay so what he's saying is something

play14:45

that I've been talking about for a

play14:46

little while now first I want to get

play14:48

this out of the way I am still going to

play14:51

teach my kids how to code simply because

play14:54

it actually helps you think about

play14:56

thinking it helps you think think better

play14:59

it helps you think systematically and

play15:02

all of these skills are very important

play15:05

when you're problem solving but as he's

play15:07

saying problem solving is the bigger

play15:11

value learning how to solve problems

play15:13

learning how to get what you want from a

play15:15

large language model is going to be

play15:18

super important now if programming is

play15:20

dead and nobody are programmers then

play15:23

technically everybody is a programmer at

play15:25

that point because natural language is

play15:28

the language of computers and that is

play15:31

going to be the case with large language

play15:32

models and artificial intelligence so

play15:35

how does this relate to the emo project

play15:37

well it seems faster than most people

play15:39

are predicting things that were

play15:42

previously really hard to do such as

play15:45

creating video Sora creating video games

play15:49

Genie creating avatars that look

play15:52

realistic emo controlling robots mimic

play15:55

gen and creating true AI to live their

play15:58

lives Voyager and Minecraft all of these

play16:01

things are becoming easier and easier

play16:04

and as all of these problems get solved

play16:06

with large language models the language

play16:08

of large language models is going to be

play16:10

natural language so I still encourage

play16:12

you if you don't know how to code learn

play16:14

the basics you don't need to go that

play16:16

deep into it anymore and I'm definitely

play16:18

going to teach my kids how to code just

play16:20

so they know how to think systematically

play16:22

if you enjoyed this video please

play16:24

consider giving a like And subscribe and

play16:26

I'll see you in the next one

Rate This

5.0 / 5 (0 votes)

Do you need a summary in English?