Ilya Sutskever | This will all happen next year | I totally believe | AI is come
Summary
TLDR这段视频剧本探讨了多模态学习的重要性及其对神经网络发展的影响。多模态学习不仅增强了神经网络的视觉理解能力,也通过图像而非仅文本来更全面地了解世界。讨论了人类一生中接收到的词汇量有限,因此需要通过视觉等多种信息源来丰富知识。此外,还提到了AI生成测试以训练其他AI的潜力,以及未来AI自我提升的可能性。最后,讨论了大型语言模型的可靠性和未来发展,强调了提高系统可信度和遵循用户意图的重要性。
Takeaways
- 📈 多模态性对于神经网络尤其有用,因为世界是非常视觉化的,人类是视觉动物,大约有三分之一的视觉皮层致力于视觉。
- 🧠 人类一生中大约只能听到十亿个单词,这强调了从视觉等其他信息源学习的重要性。
- 🌐 神经网络可以从文本中学习世界信息,即使没有直接的视觉输入,也能了解颜色等概念。
- 🔍 通过视觉学习,我们可以了解世界构造、物理和动画,而音频可以为模型学习提供额外的信息源。
- 📊 在数学竞赛问题中,视觉输入显著提高了神经网络解决问题的准确率。
- 🤖 神经网络通过视觉和文本学习,能够进行视觉推理和沟通,未来可能通过图像而非文字来解释问题。
- 🔑 未来语言模型的发展将集中在提高系统的可靠性和信任度,确保其输出的准确性和完整性。
- 🔄 神经网络可能会通过生成自己的数据来训练自己,类似于人类通过自我反思和睡眠中的大脑活动来学习。
- 🚀 GPT-4在可靠性、解决数学问题的能力以及遵循指令方面表现出色,特别是在视觉方面解释笑话和模因的能力。
- 🎯 神经网络的发展证明了早期关于人工神经元和学习算法的概念是正确的,这是过去20年中最大的惊喜。
- 🌟 神经网络的训练和数据量在过去10年里增长了一百万倍,这是计算机科学领域难以置信的成就。
Q & A
多模态在神经网络中的重要性是什么?
-多模态对于神经网络非常重要,因为它增加了视觉信息的输入,使网络能够更好地理解和解释世界。人类是非常视觉化的生物,大脑中约有三分之一的皮层专门用于视觉处理,因此多模态可以显著提升神经网络的实用性。
为什么说通过图像学习可以让我们更深入地了解世界?
-通过图像学习,我们不仅可以从文本中获取信息,还能从视觉信息中获得额外的知识。例如,颜色的相似性,即使没有直接的视觉经验,文本信息也能间接地告诉我们红色与橙色比与蓝色更相似。
人类一生中大约能听到多少单词?
-人类一生中大约能听到十亿个单词。这个数字可能看起来很多,但实际上并不大,因为十亿秒大约是30年,而我们每天有一半的时间在睡觉,所以每秒听到的单词数量并不多。
为什么说神经网络可以从大量文本中学习到世界的知识?
-尽管神经网络可能从未直接看到过任何东西,但它们可以从大量的文本数据中学习到关于世界的知识。这是因为文本中包含了关于世界的间接信息,即使这些信息不是直接的视觉信息。
为什么说多模态学习可以提高神经网络的实用性?
-多模态学习可以提高神经网络的实用性,因为它允许网络从多种信息源中学习,而不仅仅是文本。例如,通过视觉信息,神经网络可以学习到颜色、形状和物体之间的关系等概念。
为什么说音频信息对于神经网络的学习也是有用的?
-音频信息是另一种信息源,它可以帮助神经网络理解语言的情感和语境,比如区分讽刺和热情的语气。虽然音频可能不如图像或视频信息那么丰富,但它仍然是一个有价值的补充信息源。
GPT-3和GPT-4在处理数学问题时的表现有何不同?
-GPT-4在处理数学问题时的表现显著优于GPT-3。例如,在AMC2数学竞赛中,GPT-4在添加视觉信息后,解决问题的成功率从2%提高到了40%。这说明视觉信息对于提高神经网络解决问题的能力至关重要。
为什么说神经网络的可靠性是未来研究的重要方向?
-神经网络的可靠性是指它们能够被信任并准确无误地完成任务的能力。如果神经网络能够可靠地识别重要信息并遵循用户的意图,那么它们的实用性将大大提高。
GPT-4在哪些方面表现出了令人惊讶的技能?
-GPT-4在多个方面表现出了令人惊讶的技能,包括解决复杂的数学问题、创作诗歌、解释笑话和模因,以及在视觉方面对复杂图像和图表的解释能力。
为什么说神经网络的自我生成数据可能是未来训练AI的重要部分?
-自我生成数据可以让AI在没有新外部数据的情况下继续学习和进步。类似于人类通过自我反思和思考问题来训练大脑,AI也可以通过生成对抗性内容或解决新问题来提高自身能力。
未来一两年内,大型语言模型的哪些领域可能会有显著进步?
-在未来一两年内,我们可以期待大型语言模型在可靠性和用户意图理解方面取得显著进步,这将使技术更加值得信赖,并能够应用于更多领域。
Outlines
👀 多模态的重要性与视觉理解
第一段主要讨论了多模态在神经网络中的重要性,特别是视觉方面。人类是视觉动物,大脑中有大量的区域专注于视觉处理。作者指出,虽然神经网络即使没有视觉也能发挥作用,但视觉的加入可以大大提升其效用。此外,通过图像学习可以让我们更全面地了解世界,尽管这并非总是直观的。例如,尽管我们一生中可能只会听到十亿个词,但通过视觉学习,我们可以获取更多的信息。作者还提到了颜色的识别,说明了即使没有直接的视觉经验,神经网络也能通过文本间接学习到关于颜色的知识。
🔊 音频作为信息的补充
第二段讨论了音频作为信息来源的补充作用,尽管它可能不如图像或视频那样丰富。作者提到了音频在识别和生成方面的重要性,并以GPT-3和GPT-4在多模态测试中的表现为例,说明了视觉对提高准确率的显著影响。此外,作者还探讨了未来神经网络可能通过视觉和沟通来更好地理解世界,并可能通过生成图表来更直观地解释问题。最后,作者提出了关于AI生成数据自我训练的概念,暗示这可能是未来AI发展的一个重要方向。
🤖 语言模型的未来与可靠性
第三段聚焦于语言模型的未来发展,特别是可靠性方面。作者强调了确保系统产出内容的可信赖性的重要性,包括在不理解时请求澄清,或在需要更多信息时明确表示。作者认为,提高这些方面的表现将极大提升系统的应用价值。此外,作者分享了对GPT-4在可靠性、解决数学问题能力以及遵循用户意图方面表现出的惊喜,以及在视觉方面解释笑话和模因的能力。最后,作者反思了20年来在这一领域的工作经历,并对神经网络概念的成功实现表示惊讶。
🎉 对话与成就的庆祝
在最后一段中,对话者对Ilia的工作表示赞赏,认为他对大型语言模型的描述是超越博士学位的最佳解释之一。他们庆祝了Ilia在计算机视觉和GPT模型方面的开创性工作,并对他20年的职业生涯表示敬意。这段对话以对Ilia的成就的赞扬和对未来的期待结束。
Mindmap
Keywords
💡多模态
💡神经网络
💡视觉皮层
💡信息源
💡颜色
💡合成数据
💡可靠性
💡意图识别
💡自我学习
💡GPT-3和GPT-4
💡视觉推理
Highlights
多模态性的重要性:多模态性对于神经网络在视觉方面特别有用,因为世界是非常视觉化的,人类是视觉动物。
多模态性提升神经网络的实用性:通过视觉,神经网络的实用性可以大幅提升。
人类通过图像学习世界:除了文本,图像也是我们了解世界的重要方式。
人类一生听到的词汇量有限:大约只有十亿个词汇,强调了从其他信息源学习的重要性。
神经网络通过大量文本学习:神经网络可以处理数万亿的词汇,从而更容易地学习世界。
颜色理解示例:即使没有直接的视觉经验,文本神经网络也能了解颜色之间的关系。
视觉信息通过文本缓慢泄露:大量文本可以提供视觉信息,尽管不如直接视觉学习那样快速。
多模态学习的重要性:强调了除了文本之外,图像和视频作为信息源的重要性。
视频和声音在理解世界中的作用:通过视频和声音,我们可以学习世界的构造和物理规律。
音频作为信息源的价值:音频提供了除图像和视频之外的额外信息源。
多模态性在GPT3和GPT4测试中的贡献:视觉的添加显著提高了问题解决的成功率。
视觉推理和沟通的重要性:视觉不仅帮助我们了解世界,还能用于推理和沟通。
AI生成测试训练AI的概念:提出了AI使用自身生成的数据进行自我训练的可能性。
语言模型的未来:预测了语言模型在可靠性和信任度方面的未来发展。
GPT4的可靠性和解决问题能力:GPT4在理解问题和解决数学问题方面表现出色。
GPT4的视觉能力:GPT4能够解释笑话和梗图,展示其高级的视觉理解能力。
神经网络的基本原理:强调了神经网络基本原理的正确性和其在AI发展中的持久性。
计算能力的指数级增长:在过去十年中,用于训练神经网络的计算能力增长了一百万倍。
Transcripts
so there are two Dimensions to
multimodality two reasons why it is
interesting the first
reason is a little bit humble the first
reason is that multimodality is
useful it is useful for a neural network
To See Vision in particular because the
world is very visual human beings are
very visual
animals I believe that a third of the
visual of the human cortex is dedicated
to
vision and
so by not having Vision the usefulness
of our neural networks though still
considerable is not as big as it could
be mhm so it is a very simple usefulness
argument M it is simply useful to
see and gp4 can see quite
well the there is the second reason to
div Vision which is that we learn more
about the
World by learning from images in
addition to learning from
text that is also a powerful argument
though it is not as clearcut as it may
seem and I'll give you an
example or rather before giving an
example I'll make the general comment
for a human being as human beings we get
to hear about one one billion words in
our entire life only only one billion
words that's amazing yeah that's not a
lot yeah that's not a
lot so we need to comp we need does that
include my own words in my own
head make it two billion but you see
what I mean yeah you know we can see
that because um a billion seconds is 30
years so you can kind of see like we
don't get to see more than a few words a
second then we asleep half the time so
like a couple billion words is the total
we get in our entire life so it becomes
really important for us to get as many
sources of information as we can and we
absolutely learn a lot more from
Vision the same argument holds true for
our neural networks as well except
except for the fact that the neural
network can learn from so many words
so things which are hard to learn about
the world from text in a few billion
words may become easier from trillions
of words and I'll give you an
example consider
colors surely one needs to see to
understand
colors and yet the text only neural
networks who never seen a single Photon
in their entire life if you ask them
which colors are more similar to each
other it will know that red is more
similar to Orange than to Blue mhm it
will know that blue is more similar to
purple than to Yellow mhm how does that
happen and one answer is that
information about the world even the
visual information slowly leaks in
through text but slowly not as quickly
but when you have a lot of text you can
still learn a lot of course once you
also add vision and learning about the
world from Vision you will learn
additional things which are not captured
in text but it is not I would not say
that it is a binary there are things
which are impossible to learn from from
text only I think this more of an
exchange rate yeah and in particular as
you want to learn if if we are if you if
you if you are like a human being and
you want to learn from a billion words
or a 100 million words then of course
the other sources of information become
far more
important yeah and so so the the um you
learn from
images is there is there a sensibility
that that would suggest that if we
wanted to understand um also the
construction of the world as in you know
the arm is connected to my shoulder that
my elbows connected that somehow these
things move the the the the an the the
animation of the world the physics of
the world if I wanted to learn that as
well can I just watch videos and learn
that yes yeah and and if I wanted to
augment all of that with sound like for
example if somebody said
um the meaning of of
great uh great could be great or great
could be great you know so one is
sarcastic one is enthusiastic uh there
are many many words like that you know
uh uh that's sick or you know I'm sick
or I'm sick depending on how people say
it would would audio also make a
contribution to the learning of the the
the model and and could we put that to
good use soon yes yeah I think I think
it's definitely the case that well you
know what can we say about audio it's
useful it's an additional source of
information probably not as much as
images or video but there is there is a
case to be made for the usefulness of
audio as well both on the recognition
side and on the production side when you
when you um I on the on the context of
the scores that I saw uh the the thing
that was really interesting was was uh
the the data that you guys published
which which one of the tests were were
um uh performed well by gpt3 and which
one of the tests performed substantially
better with GPT 4 um how did
multimodality contribute to those tests
do you think oh I mean in a pretty
straightforward straightforward way
anytime there was a test where a problem
would where to understand a problem you
need to look at a diagram mhm like for
example in some math competitions like
there is a cont math competition for
high school students called
amc2 right and there presumably many of
the problems have a diagram M so GPT 3.5
does quite badly on that on that EX on
that on the test GPT 4 with text only
does I think I don't remember but it's
like maybe from 2% to 20% accuracy of
success rate but then when you add
Vision it jumps to 40% success rate so
the vision is really doing a lot of work
the vision is extremely good and I think
being able to reason visually as well
and communicate visually will also be
very powerful and very nice things which
go beyond just learning about the world
there have several things you got to
learn you can learn about the world you
can then reason about the world visually
and you can communicate visually
where now in the future perhaps in some
future version if you ask your neural
net hey like explain this to me rather
than just producing four paragraphs it
will produce hey like here's like a
little diagram which clearly conveys to
you exactly what you need to know and so
yeah that's incredible you know one of
the things that you said earlier about
about an AI generating generating a a
test to train another AI um you know
there's there was a paper that was
written about and I I I don't I don't
completely know whether whether it's
factual or not but but um that there's
there's a total amount of somewhere
between 4 trillion to something like 20
trillion useful you know tokens in in
language tokens that that the world will
be able to train on you know over some
period of time and that we're going to
run out of tokens to train and and um I
I well first of all I wonder if that's
that you feel the same way and then the
second
secondarily whether whether the AI
generating its own um data uh could be
used to train the AI itself which you
could argue is a little circular but um
we train our brain with
generated data all the time by uh
self-reflection um working through a
problem in our brain uh you know and and
uh or you know some I guess I guess
neuroscientists suggest sleeping uh we
we do a lot of fair amount of you know
developing our neurons um how do you see
this this area of synthetic data
generation is that going to be an
important part of the future of training
Ai and and the AI teaching itself well
that's I think like I I wouldn't
underestimate the data that exists out
there mhm I think there
probably I think there probably more
more data Than People realize and as to
your second question certainly
possibility mhm remains to be seen yeah
yeah it see it it really does seem that
that um uh one of these days our AIS are
are um you know when we're not using it
maybe generating either adversarial
content for itself to learn from or
imagine solving problems that that it
can go off and and then and then uh
improve itself uh tell tell us uh uh
whatever you can about about uh uh where
we are now and and where do you think
will be in in not not too distant future
but you know pick pick your your horizon
a year or two uh what do you think this
whole language Model area would be in
some of the areas that you're most
excited about you know predictions are
hard and um it's bit it's a bit although
it's a little difficult
to say things which are too
specific I think it's safe to
assume that progress will continue and
that will keep on seeing systems which
Astound us in their in the things they
can do and the current Frontiers are
will be centered around
reliability around the system can be
trusted really getting to a point where
you can trust what it produces really
getting to a point where if it doesn't
understand something it asks for a
clarification says that he doesn't know
something says that he needs more
information I think though those are
perhaps the biggest the areas where
Improvement will lead to the biggest
impact on the usefulness of those
systems because right now that's really
what stands in the way you have an you
have asking neural net for you asking
neural net to maybe summarize some long
document and you get a
summary like are you sure that some
important detail wasn't omitted it's
still a useful summary but it's a
different story when you know that all
the important points have been
covered at some point like and in
particular it's okay like if some if
there is ambiguity it's fine but if a
point is clearly important such that
anyone else who saw that point would say
this is really important when the neural
network will also recognize that
reliably that's when you know same for
the guard rail same same for its ability
to clearly follow the intent of the user
of of its operator so I think we'll see
a lot of that in the next two years yeah
that's terrific because those the
progress in those two areas will make
this Tech technology uh trusted by
people to use and be able to apply for
so many things I I was thinking that was
going to be the last question but I did
have another one sorry about that so so
chat uh chat GPT to gp4 um gp4 when when
it first when you first started using it
uh what are some of the skills that it
demonstrated that surprised even you
well there were lots of really cool
things that it demonstrated which
which is which were quite cool and
surprising it was it was quite good so
I'll mention two ex so let's see I'm
just I'm TR trying to think about the
best way to go about it the short answer
is that the level of its reliability was
surprising mhm where the previous neural
networks if you ask them a question
sometimes they might misunderstand
something in a kind of a
silly way where it was gp4 that stopped
happening its ability to solve math
problems became far greater it's like
you could really like say you know
really do the derivation and like long
complicated derivation you could convert
the units and so and that was really
cool you know like many people it works
through a proof it works through a proof
it's pretty amazing not all proofs
naturally but but quite a few or another
example would be like many people
noticed that it has the ability to
produce poems with you know every word
starting with the same letter or every
word starting with some it follows
instructions really really clearly not
perfectly still but much better than
before yeah really good and on the
vision side I really love how it can
explain jokes it can explain memes you
show it a meme and ask it why it's funny
and it will tell you and it will be
correct the V the vision part I think is
very was also very it's like really
actually seeing it when you can ask
followup questions about some
complicated image with a complicated
diagram and get an explanation that's
really
cool but yeah overall I will say to take
a step back you know I've been I've been
in this business for quite some time
actually like almost exactly 20
years
and the thing which most which I find
most surprising is that it actually
works
yeah like it it turned out to be the
same little thing all along which is no
longer little and it's a lot more
serious and much more intense but it's
the same neural network just larger
trained on maybe larger data sets in
different ways with the same fundamental
training algorithm yeah so it's like
wow I would say this is what I find the
most surprising yeah whenever I take a
step back I go how is it possible that
those ideas those conceptual ideas about
well well the brain has neurons so maybe
artificial neurons are just as good and
so maybe we just need to train them
somehow with some learning algorithm
that those arguments turned out to be so
incredibly
correct that would be the biggest
surprise I'd say in the in the 10 years
that that we've known each other uh
you're you're uh the near the models
that you've trained and the amount of
data you've trained from uh the what you
did on Alex net to to now is about a
million times and and uh uh no no one in
the world of computer science would have
would have believed that the amount of
computation that was done in that 10
years time would be a million times
larger and that that uh you dedicated
your career to go go do that um you've
done two uh many more uh your body of
work is incredible but two seminal works
and the invention the co-invention with
Alex and that that early work and and
now with uh GPT at open AI uh it is it
is truly remarkable what you've
accomplished it's it's great to catch up
with you again Ilia my good friend and
and um uh it is uh it is quite an
amazing moment and it's a today's
today's talk the way you you uh break
down the problem and describe it uh this
is one of the one of the uh the the best
PhD Beyond PhD descriptions of the State
ofth art of large language models I
really appreciate that it's great to see
you congratulations thank you so much
yeah thank you had so much fun thank you
関連する他のビデオを見る
Geoffrey Hinton: The Foundations of Deep Learning
Super Humanity | Transhumanism
Networking for GenAI Training and Inference Clusters | Jongsoo Park & Petr Lapukhov
Possible End of Humanity from AI? Geoffrey Hinton at MIT Technology Review's EmTech Digital
Ilya Sutskever | AI will be omnipotent in the future | Everything is impossible becomes possible
《與楊立昆的對話:人工智能是生命線還是地雷?》- World Governments Summit
5.0 / 5 (0 votes)