Googles GEMINI 1.5 Just Surprised EVERYONE! (GPT-4 Beaten Again) Finally RELEASED!
Summary
TLDR谷歌最新发布的Gemini 1.5模型引发了广泛关注,其强大的处理能力令人惊叹。这一迭代模型能够处理长达3小时的视频、22小时的音频,以及高达700万词或1000万标记的数据,准确率高达99至100%。Gemini 1.5在文本、视觉和音频处理方面均有显著提升,尤其在长文本和多模态任务上展现出了卓越的理解和分析能力。通过分析长达432页的阿波罗11号任务记录、处理大量编程代码,以及理解和分析电影场景等示例,Gemini 1.5证明了其对大数据量和复杂情境的高效处理能力。这一模型的推出不仅展示了谷歌在人工智能领域的领先地位,也为未来的AI应用开辟了新的可能性。
Takeaways
- 😮 谷歌推出了革命性的Gemini 1.5模型,可以处理最长3小时的视频、22小时的音频和700万个单词。
- 🤩 Gemini 1.5在文本、视觉和音频方面均表现出色,其准确率达到99%-100%,远远超过现有的大多数模型。
- 🧠 Gemini 1.5能够通过简单的图片或文字描述,准确理解并找到视频或文本中的特定内容。
- 💪 Gemini 1.5的上下文窗口长达100万个令牌,可以极大扩展AI系统的应用范围。
- 💻 Gemini 1.5展示了惊人的能力,可以浏览数百万行的代码,并根据用户的需求修改和改进代码。
- 🎥 Gemini 1.5还能够处理数小时的视频内容,并精确找到某一帧上的关键信息。
- 🌍 Gemini 1.5可以像人类一样利用词典和语法书进行英语到卡曼语的准确翻译。
- 📊 Gemini 1.5在数学、科学、编码和指令遵循等多个基准测试中表现优异,大幅领先现有模型。
- 🏭 谷歌使用了大量的TPU加速计算资源来训练Gemini 1.5,使其能够处理多模态和多语言数据。
- 🤯 Gemini 1.5的推出将重塑AI格局,给其他公司施加了巨大的竞争压力。
Q & A
谷歌最新发布的是什么模型?
-谷歌发布了新的Gemini模型家族的1.5版本。
Gemini 1.5模型的最大context长度能达到多少?
-Gemini 1.5模型最大能够处理3小时的视频、22小时的音频或700万单词(1000万个token)的文本。
Gemini 1.5模型的准确率如何?
-Gemini 1.5模型在处理大规模数据时,准确率可达99%到100%。
Gemini模型家族有几个版本?各自负责什么任务?
-Gemini模型家族有三个版本:Gemini Ultra用于复杂任务,Gemini Pro 1.0和1.5用于长时间或大规模数据任务,Gemini 1.5专门用于更大更繁琐的长期上下文任务。
Gemini 1.5模型在哪些方面表现更好?
-相较于之前的Gemini Pro 1.0版本,Gemini 1.5在文本、视觉和音频处理上都有显著提升。相较于Gemini Ultra,Gemini 1.5在视觉和音频方面表现略好。
文中展示了哪些Gemini 1.5处理大规模数据的实例?
-文中展示了Gemini 1.5处理432页阿波罗11号文字记录、3js代码库以及44分钟巴斯特·基顿电影的实例。
在电影问答的实例中,Gemini 1.5都展现了哪些能力?
-Gemini 1.5能够从44分钟的电影中准确定位某个特定时间点的情景细节,并结合图画输入进行多模态推理和时间戳定位。
谷歌是如何评估Gemini 1.5模型的准确性的?
-谷歌通过特制的“视频海量搜索”和“文本海量搜索”任务来测试Gemini 1.5在处理大规模数据时的准确性,结果显示Gemini 1.5在这些任务上准确率可达99%以上。
Gemini 1.5模型的性能如何?
-从展示的基准测试来看,Gemini 1.5在数学、科学推理、编码、指令跟随等多个领域都超越了之前的版本,整体性能大幅提升。
谷歌是如何训练Gemini 1.5模型的?
-Gemini 1.5模型使用谷歌的TPU V4加速器和分布式训练,在多个数据中心使用多种多模态和多语言数据进行训练。
Outlines
😲 谷歌发布震惊业界的 Gemini 1.5 AI 模型
本段介绍了谷歌最新发布的 Gemini 1.5 AI 模型的惊人能力。Gemini 1.5 能处理长达 3 小时的视频、22 小时的音频或 700 万个单词,准确率高达 99%至 100%,远远超过其他模型。该模型可能改变一切,堪称游戏规则改变者。本段还介绍了 Gemini 模型家族的层次,Gemini 1.5 定位于中端 Pro 系列,大幅超越了当前版本。
🤯 Gemini 1.5 Pro 处理百万级多模态数据的令人震惊能力
本段分享了一些令人震惊的 Gemini 1.5 Pro 示例,展示了它处理百万字级数据的无与伦比的能力。它能高精度理解 43 万字的阿波罗 11 号航行记录,并根据提示和手绘图片准确定位相关片段时间点。另外,在一个 44 分钟的赫敏基顿电影中,它能找出隐藏在其中单个帧的秘密信息。这些能力是前所未见的。
🚀 Gemini 1.5 Pro 展现非凡的多模态 AI 能力
本段概括了论文中分享的更多令人惊叹的 Gemini 1.5 Pro 示例。模型能从英语和词典等材料中学习,像人一样翻译成新语种。且能根据简单的手绘图像理解复杂的小说情节。此外,还展示了视频、文字查针测试,1.5 Pro 能在数百万个词、上万帧的媒体中 100%准确找到隐藏的关键信息,而其他模型都失败了。更有各项基准测试证明 1.5 Pro 在核心能力、编码等多个方面全面超越其他模型。
⭐ Gemini 1.5 Pro 的工程能力和前景展望
本段提到,Gemini 1.5 Pro 使用了多个 TPU 加速器分布式训练,并结合了多模态多语种数据来构建模型。讨论了人们对其强大能力的兴奋预期,尤其是前所未有的视频推理能力。人们期待谷歌能凭此加强 AI 市场竞争,尽管面临其他公司的反击,但目前似乎仍处于绝对领先地位。最后鼓励观众分享自己使用 Gemini 1.5 Pro 的期待,并思考谷歌未来的发展方向。
Mindmap
Keywords
💡Gemini 1.5
💡超大上下文窗口
💡多模态
💡基准测试
💡长上下文理解
💡增强的编码能力
💡视频、音频和文本处理
💡训练过程
💡人工智能竞争
💡创新应用
Highlights
Google 意外发布了 Gemini 1.5 模型,这是他们 Gemini 模型系列的最新迭代。
Gemini 1.5 是一个能够一次处理 3 小时视频、22 小时音频或 700 万字的庞大模型。
Gemini 1.5 的准确率达到了 99-100%,这是非常不可思议的。
Gemini 1.5 在文本、视觉和音频方面都优于 Gemini Pro 1.0。
Gemini Ultra 在视觉和音频方面比 Gemini 1.5 略强,但总体 Gemini 1.5 更强大。
Gemini 1.5 能从 432 页的阿波罗 11 号航天飞机的航天飞机文字记录中准确地检索出滑稽的片段。
只给 Gemini 1.5 一张简单的图片,它就能正确识别出该画面对应的时间点。
Gemini 1.5 能精准地在几千行代码中找出关于人物动画的例子。
Gemini 1.5 可以根据代码的改动要求,实时修改代码并预览变化效果。
给 Gemini 1.5 一个 44 分钟的巴斯特·基顿电影,它能从中找出一张纸条并精确说出上面的文字和时间戳。
根据简单的绘图,Gemini 1.5 能推断出对应电影中的具体场景和时间点。
Gemini 1.5 能根据基础材料为人类所学的水平进行英语到克曼语的准确翻译。
Gemini 1.5 能从一部超长电影的单帧图像中准确推理出对应场景。
Gemini 1.5 在长文本推理和查找方面的准确率可达 99%。
Gemini 1.5 在诸多基准测试如数学、科学推理、编码等方面均优于 GPT-4。
Transcripts
so Google actually did just surprise
Everyone by releasing Gemini 1.5 and
this is their latest iteration of their
family of Gemini models and this is a
rather surprising model in the fact that
it is able to do something incredible
Gemini 1.5 is the Behemoth that is able
to take up to 3 hours of video in a
single context length it's also able to
take 22 hours of audio at once it's also
able to take up to 7 million words or 10
million tokens with remarkable accuracy
as well because lots of the time when we
see new models appear many of the times
what happens is is that their accuracy
rates are very very underwhelming and
Gemini is just outstanding because on
these capabilities their accuracy rate
is around 99 to 100% so that is
absolutely incredible this multimotor
model is going to change everything and
let's take a look at some of the things
you do need to know because once you see
a few videos you're going to be truly
surprised by how good this AI really is
so where is Gemini 1.5 before we dive
into some of the examples of how good
this AI system is where does it fit so
on the left hand side you can see Gemini
Ultra our most capable and largest model
for highly complex tasks and then in the
middle you can see Gemini Pro and you've
got two iterations Gemini 1.0 and Gemini
1.5 and Gemini 1.5 is the model that was
released today which is essentially for
larger more tedious tasks that require a
longer context l so how much better is
Gemini 1.5 so you can see that in text
in vision and in audio Gemini 1.5 is
better across the board however compared
to the ultra benchmarks you can see that
only on vision and audio on the right
hand side there are some areas where
Ultra is slightly better so overall this
model is substantially better than
Gemini Pro 1.0 which is currently
available and in terms of Gemini Ultra
largely it's better than text and
Invision so across the board this is a
model that is most certainly more
capable now I'm going to be showing you
guys one of these examples of Gemini Pro
1.5 reasoning across a 432 page
transcript this is a demo of long
context understanding an experimental
feature in our newest model Gemini 1.5
Pro we'll walk through screen recording
of example prompts using a 402 page PDF
of the Apollo 11 transcript which comes
out to almost 330,000
tokens we started by uploading the
Apollo PDF into Google AI
studio and asked find three comedic
moments list quotes from this transcript
and
Emoji this screen capture is sped up
this timer shows exactly how long it
took to process each prompt and keep in
mind that processing times will vary
the model responded with three quotes
like this one from Michael Collins I'll
bet you a cup of coffee on
it if we go back to the transcript we
can see the model found this exact quote
and extracted the comedic moment
accurately then we tested a multimodal
prompt we gave it this drawing of a
scene we were thinking of and asked what
moment is
this the model correctly identified it
as Neil's first steps on the moon notice
how we didn't explain what was happening
in the drawing simple drawings like this
are a good way to test if the model can
find something based on just a few
abstract details and for the last prompt
we ask the model to cite the time code
of this moment in the transcript like
all generative models responses like
this won't always be perfect they can
sometimes be a digit or two off but
let's look at the model's response
here and when we find this moment in the
transcript we can see that this time
code is
correct these are just a few examples of
what's possible with a context window of
up to 1 million multimodal tokens in
Gemini 1.5 Pro that demo right there was
rather impressive and there are a lot
more examples in the paper but let's
take a look at another example of
Gemini's massive capabilities on doing
this with coding
tasks this is a demo of long context
understanding an experimental feature in
our newest model Gemini 1.5 Pro we'll
walk through some example prompts using
the 3js example code which comes out to
for 800,000 tokens we extracted the code
for all of the 3js examples and put it
together into this text file which we
brought into Google AI Studio over here
we asked the model to find three
examples for learning about character
animation the model looked across
hundreds of examples and picked out
these three one about blending skeletal
animations one about poses and one about
morph targets for facial animations all
good choices based on our prompt in this
test the model took around 60 seconds to
respond to each of these prompts but
keep in mind that latency times might be
higher or lower as this is an
experimental feature we're
optimizing next we asked what controls
the animations on the littlest Tokyo
demo as you can see here the model was
able to find that demo and it explained
that the animations are embedded within
the gltf
model next we wanted to see if we could
customize this code for us so we asked
show me some code to add a slider to
control the speed of the animation use
that kind of gooey the other demos
have this is what it looked like before
on the original 3js site and here's the
modified version it's the same scene but
it added this little slider to speed up
slow down or even stop the animation on
the fly it used this gooey Library the
other demos have set a parameter called
animation speed and wired it up to the
mixer in the scene like all generative
models responses aren't always perfect
there's actually not in a knit function
in this demo like there is in most of
the others however the code it gave us
did exactly what we wanted next we tried
a multimodal input by giving it a
screenshot of one of the demos we didn't
tell it anything about this screenshot
and just asked where we could find the
code for this demo seen over here as you
can see the model was able to look
through the hundreds of demos and find
the one that matched the image next we
asked the model to make a change to the
scene asking how can I modify the code
to make the terrain flatter the model
was able to zero in on one particular
function called generate height and
showed us the exact line to tweak below
the code it clearly explained how the
change works over here in the updated
version you can see that the terrain is
indeed flatter just like we
asked we tried one more code
modification task using this 3D text
demo over here we asked I'm looking at
the text geometry demo and I want to
make a few tweaks how can I change the
text to say goldfish and make the mesh
materials look really shiny and
metallic you can see the model
identified the correct demo and showed
the precise lines in it that need to be
tweaked further down it explained these
material properties metalness and
roughness and how to change them to get
a shiny
effect you can see that it definitely
pulled off the task and the text looks a
lot shinier now these are just a couple
examples of what's possible with a
context window of up to 1 million
multimodal tokens in Gemini 1.5 you just
saw Google's Gemini 1.5 Pro problem
solving across 100,000 lines of code
Mayo my this is something that is truly
impressive there is no other AI system
out there that can do this with the
accuracy level of Google's Gemini but
now let's take a look at some of the
multimodal prompting which is going to
be used by a lot of standard uses this
is a demo of long context understanding
an experimental feature in our newest
model Gemini 1.5 Pro we'll walk through
a screen recording of example prompts
using a 44-minute but Buster Keaton film
which comes out to over 600,000
tokens in Google AI Studio we uploaded
the video and asked find the moment when
a piece of paper is removed from the
person's pocket and tell me some key
information on it with the time
code this screen capture is sped up and
this timer shows exactly how long it
took to process each prompt and keep in
mind that processing times will vary the
model gave us this response explaining
that the piece of paper is a pawn ticket
from Goldman and Company Pawn Brokers
with the date and cost and it gave us
this time code
1201 when we pulled up that time code we
found it was correct the model had found
the exact moment the piece of paper is
removed from the person's pocket and it
extract a text
accurately next we gave it this drawing
of a scene we were thinking of and asked
what is the time code when this
happens this is an example of a
multimodal prompt where where we combine
text and image in our
input the model returned this time code
1534 we pulled that up and found that it
was the correct scene like all
generative models responses vary and
won't always be
perfect but notice how we didn't have to
explain what was happening in the
drawing simple drawings like this are a
good way to test if the model can find
something based on just a few abstract
details like it did
here these are just a couple examples of
what's possible with a context window of
up to 1 million multimodal tokens in
Gemini 1.5 Pro that right there goes to
show us how crazy this is I think the
only caveat to this is that it does take
a little bit of time for it to go ahead
and get the footage but looking through
a 44-minute video is absolutely
incredible and doing the reasoning
across that is not to be understated
because think about how long it would
take a human to watch through an entire
movie and find something from one frame
and whilst these demos are impressive
what's even more impressive is the paper
that they attach to this which I read
that shows a whole host of other
incredible capabilities so let's take a
look at some of these stunning examples
from the paper which is going to show
you all exactly how accurate this AI
system really is and why Google are
really leading the entire AI industry
with Gemini 1.5 Pro so there was this
example and it stated given a reference
grammar book and bilingual word list a
dictionary Gemini 1.5 is able to
translate from English to kamang with
similar quality to a human who learned
from the same materials this is
incredibly substantial because it means
that not only is it able to get the
entirety of this context length and a
dictionary it's able to reason and do
translation based on a new data just
like a human would there was also this
example right here that was was another
stunning example from the paper and
essentially it states that with the
entire text of this really really long
novel it's able to understand exactly
what's happening just through a very
simple drawing and I'm no artist but I'm
sure you can all appreciate the fact
that this drawing here isn't a very very
artistic one and it's really really
simple so the genius here is of this
system to be able to understand the
Nuance of what's happening in the image
then extrapolate that data out and of
course reason to figure out exactly
where that is that is something that is
unheard of in our current AI systems and
that's why I stated this is truly
gamechanging stuff there was another
example in the paper and I'm pretty sure
you've already seen this one based in
the video but it just goes to show how
crazy this is now in the paper some of
the stuff I was looking at was really
cool because there was this thing called
video Hast stack okay and I'm going to
break this down for you guys because
it's truly fascinating on how accurate
this really is and how they tested it
goes to show how accurate this is now on
the image you can see Gemini 1.5 Pro
compared to GPT 4 with vision and
unfortunately GPT 4 with the vision can
only take in 3 minutes in their API
whereas Gemini 1.5 can do 1 minute of
content all up to the way of 3 hours so
essentially they set up a game the
computer which is Gemini 1.5 has to find
a secret message and the secret word is
needle but this message was sneakily
hidden in one tiny part of a very long
movie and this movie isn't just any
movie it was a thre long hour movie made
by sticking two copies of a documentary
about the game of go together and this
makes the video really long with lots of
places that could have hidden the
message now in this demo what they did
was they put the secret message only in
one single frame of the video that's
just one picture out of a thousands and
thousands that make up the entire movie
and of course there's a picture every
single second now Gemini 1.5 PR's job
was to watch this entire super long
movie and find that one frame with the
secret message and all they did was they
asked Gemini 1.5 what was this secret
word which is essentially like finding
an needle in a hay stack and can you
guess what Gemini 1.5 was able to do it
100% of the time so that is why the
video capability
the video hstack capabilities are
absolutely incredible in addition they
did the same kind of game with the
Gemini 1.5 Pro system and they did it
with 22 hours of footage and you can see
here that it was able to do it up to
100% And they compared it to whisper and
GPT 4 Turbo with 12 minutes all the way
up to 11 hours and you can see the boxes
in red that are essentially areas where
it completely failed in addition they
also did this with text Hast stack and
this is where things start to get crazy
because this was something that people
didn't really think was possible there
were certain research papers that were
stating that you know using Mambo was
essentially going to be possible with
this kind of you know output that we
really wanted if we really wanted to be
able to get the retrieval that we wanted
we're going to have to use different
architectures but it seems like Google
managed to figure out how to do that and
you can see right here that up to 10
million tokens they were able to get the
accuracy up to around you know I think
it was 99% a ridiculous level of
accuracy and that is something that is
unheard of a 1 million context length
window is incredible and of course
compared to GPT 4 Turbo it's only a
128,000 contact l so this is a truly
game-changing thing because imagine
having 1 million tokens and then getting
an AI system to be able to reason about
the entirety of that or find certain
things and then reason on that that is
going to be a huge different thing now
there were additionally some benchmarks
so we can see here here the comparison
between GPT 4 vision and Gemini 1.5 Pro
on a 1-hour video QA and experiments are
run by sampling one video frame per
second and linearly subsampling 16 or
150 frames and you can see here that
Gemini 1.5 Pro outperforms GPT 4 with
vision substantially because not only
does it outperform the 16 frames and the
150 frames it does actually support for
the video whereas GPT 4 with vision
currently doesn't now any
we can take a look at some of the
benchmarks to see exactly what is going
on you can see right here that the core
capabilities like math science and
reasoning and coding and instruction
following are up across the board in
this model and what's crazy is that in
terms of the actual families of model
like if we take a look at where Gemini
Pro 1.5 sits we know that Pro 1.5 sits
in the middle in terms of what the model
is going to be able to do so that leads
me to believe that potentially we could
be getting an ultra 2.0 or an ultra 1.5
but with these benchmarks we can see
that Gemini 1.5 is literally better
across the board and it has a hugely
increased contact length that's going to
allow a lot more things now if you want
to take a look at some of the individual
detailed benchmarks you can see the math
ones right here you can see that 1.5 Pro
outperforms it on the hell swag doesn't
on the MML U does on the GSM 8K does on
the math doesn't on the rest of these
and does on the big bench so across the
board you can see that Gemini 1.5 Pro is
really taking the cake here in terms of
what is possible with an AI system and
of course in addition the detailed
benchmarks in coding we can see that you
know it's half and half in terms of
these capabilities but it is
77.7% on the natural to code benchmarks
and one thing that I did want to find
out was of course how they train this
model and like Gemini 1.0 Ultra and
Gemini 1.0 Pro Gemini 1.5 Pro was
actually trained on multiple 496 chip
PODS of Google's TPU for Vu accelerators
distributed across multiple data centers
and on a variety of multimodal and
multilingual data now with that being
said are you excited for Google's family
of models that are absolutely incredible
and are you going to be taking a look
and using this model in Google's Ai and
of course with things like the video
capabilities that haven't been done by
any other AI system before are you
excited to potentially use these to
reason and figure out things in certain
videos either way I'm excited for Google
to finally beef up the competition and
make a more competitive AI space but it
will be interesting to see how other AI
companies do respond because right now
it seems that Google is well in the lead
benchmarks are here and the benchmarks
are clear and some of the AI systems
right now don't even have some of these
capabilities so with that being said if
you did enjoy this don't forget to leave
your comment below on where you think
Google is going to go
next
5.0 / 5 (0 votes)