Googles GEMINI 1.5 Just Surprised EVERYONE! (GPT-4 Beaten Again) Finally RELEASED!

TheAIGRID
15 Feb 202418:00

Summary

TLDR谷歌最新发布的Gemini 1.5模型引发了广泛关注,其强大的处理能力令人惊叹。这一迭代模型能够处理长达3小时的视频、22小时的音频,以及高达700万词或1000万标记的数据,准确率高达99至100%。Gemini 1.5在文本、视觉和音频处理方面均有显著提升,尤其在长文本和多模态任务上展现出了卓越的理解和分析能力。通过分析长达432页的阿波罗11号任务记录、处理大量编程代码,以及理解和分析电影场景等示例,Gemini 1.5证明了其对大数据量和复杂情境的高效处理能力。这一模型的推出不仅展示了谷歌在人工智能领域的领先地位,也为未来的AI应用开辟了新的可能性。

Takeaways

  • 😮 谷歌推出了革命性的Gemini 1.5模型,可以处理最长3小时的视频、22小时的音频和700万个单词。
  • 🤩 Gemini 1.5在文本、视觉和音频方面均表现出色,其准确率达到99%-100%,远远超过现有的大多数模型。
  • 🧠 Gemini 1.5能够通过简单的图片或文字描述,准确理解并找到视频或文本中的特定内容。
  • 💪 Gemini 1.5的上下文窗口长达100万个令牌,可以极大扩展AI系统的应用范围。
  • 💻 Gemini 1.5展示了惊人的能力,可以浏览数百万行的代码,并根据用户的需求修改和改进代码。
  • 🎥 Gemini 1.5还能够处理数小时的视频内容,并精确找到某一帧上的关键信息。
  • 🌍 Gemini 1.5可以像人类一样利用词典和语法书进行英语到卡曼语的准确翻译。
  • 📊 Gemini 1.5在数学、科学、编码和指令遵循等多个基准测试中表现优异,大幅领先现有模型。
  • 🏭 谷歌使用了大量的TPU加速计算资源来训练Gemini 1.5,使其能够处理多模态和多语言数据。
  • 🤯 Gemini 1.5的推出将重塑AI格局,给其他公司施加了巨大的竞争压力。

Q & A

  • 谷歌最新发布的是什么模型?

    -谷歌发布了新的Gemini模型家族的1.5版本。

  • Gemini 1.5模型的最大context长度能达到多少?

    -Gemini 1.5模型最大能够处理3小时的视频、22小时的音频或700万单词(1000万个token)的文本。

  • Gemini 1.5模型的准确率如何?

    -Gemini 1.5模型在处理大规模数据时,准确率可达99%到100%。

  • Gemini模型家族有几个版本?各自负责什么任务?

    -Gemini模型家族有三个版本:Gemini Ultra用于复杂任务,Gemini Pro 1.0和1.5用于长时间或大规模数据任务,Gemini 1.5专门用于更大更繁琐的长期上下文任务。

  • Gemini 1.5模型在哪些方面表现更好?

    -相较于之前的Gemini Pro 1.0版本,Gemini 1.5在文本、视觉和音频处理上都有显著提升。相较于Gemini Ultra,Gemini 1.5在视觉和音频方面表现略好。

  • 文中展示了哪些Gemini 1.5处理大规模数据的实例?

    -文中展示了Gemini 1.5处理432页阿波罗11号文字记录、3js代码库以及44分钟巴斯特·基顿电影的实例。

  • 在电影问答的实例中,Gemini 1.5都展现了哪些能力?

    -Gemini 1.5能够从44分钟的电影中准确定位某个特定时间点的情景细节,并结合图画输入进行多模态推理和时间戳定位。

  • 谷歌是如何评估Gemini 1.5模型的准确性的?

    -谷歌通过特制的“视频海量搜索”和“文本海量搜索”任务来测试Gemini 1.5在处理大规模数据时的准确性,结果显示Gemini 1.5在这些任务上准确率可达99%以上。

  • Gemini 1.5模型的性能如何?

    -从展示的基准测试来看,Gemini 1.5在数学、科学推理、编码、指令跟随等多个领域都超越了之前的版本,整体性能大幅提升。

  • 谷歌是如何训练Gemini 1.5模型的?

    -Gemini 1.5模型使用谷歌的TPU V4加速器和分布式训练,在多个数据中心使用多种多模态和多语言数据进行训练。

Outlines

00:00

😲 谷歌发布震惊业界的 Gemini 1.5 AI 模型

本段介绍了谷歌最新发布的 Gemini 1.5 AI 模型的惊人能力。Gemini 1.5 能处理长达 3 小时的视频、22 小时的音频或 700 万个单词,准确率高达 99%至 100%,远远超过其他模型。该模型可能改变一切,堪称游戏规则改变者。本段还介绍了 Gemini 模型家族的层次,Gemini 1.5 定位于中端 Pro 系列,大幅超越了当前版本。

05:01

🤯 Gemini 1.5 Pro 处理百万级多模态数据的令人震惊能力

本段分享了一些令人震惊的 Gemini 1.5 Pro 示例,展示了它处理百万字级数据的无与伦比的能力。它能高精度理解 43 万字的阿波罗 11 号航行记录,并根据提示和手绘图片准确定位相关片段时间点。另外,在一个 44 分钟的赫敏基顿电影中,它能找出隐藏在其中单个帧的秘密信息。这些能力是前所未见的。

10:01

🚀 Gemini 1.5 Pro 展现非凡的多模态 AI 能力

本段概括了论文中分享的更多令人惊叹的 Gemini 1.5 Pro 示例。模型能从英语和词典等材料中学习,像人一样翻译成新语种。且能根据简单的手绘图像理解复杂的小说情节。此外,还展示了视频、文字查针测试,1.5 Pro 能在数百万个词、上万帧的媒体中 100%准确找到隐藏的关键信息,而其他模型都失败了。更有各项基准测试证明 1.5 Pro 在核心能力、编码等多个方面全面超越其他模型。

15:03

⭐ Gemini 1.5 Pro 的工程能力和前景展望

本段提到,Gemini 1.5 Pro 使用了多个 TPU 加速器分布式训练,并结合了多模态多语种数据来构建模型。讨论了人们对其强大能力的兴奋预期,尤其是前所未有的视频推理能力。人们期待谷歌能凭此加强 AI 市场竞争,尽管面临其他公司的反击,但目前似乎仍处于绝对领先地位。最后鼓励观众分享自己使用 Gemini 1.5 Pro 的期待,并思考谷歌未来的发展方向。

Mindmap

Keywords

💡Gemini 1.5

Gemini 1.5 是谷歌最新发布的AI模型系列Gemini的一个重大迭代。这个模型具有非凡的能力,可以处理长达3小时的视频、22小时的音频或700万词的文本,且在这些大规模数据下保持99%到100%的准确率。Gemini 1.5被称为「超大号AI模型」,能够革新AI技术的发展。

💡超大上下文窗口

超大上下文窗口是Gemini 1.5最显著的优势。该模型可以在一次处理中存取1000万个标记令牌的上下文,远远超过了之前的AI模型。这种大规模上下文允许Gemini 1.5在处理长视频、音频或文本时具备更高的理解能力和准确度。视频中展示的例子清晰说明了Gemini 1.5在这一优势下的卓越表现。

💡多模态

多模态指的是Gemini 1.5可以将多种不同类型的输入,如文本、图像和视频,整合在一起处理。在视频中,Gemini 1.5展现了大规模处理长视频的同时,还能正确回答基于简单图片线索的问题,这就是多模态处理的优势所在。多模态让AI的理解能力更接近人类,为之后的交互和应用带来更多可能性。

💡基准测试

基准测试是评估和比较不同AI模型性能的一种重要方式。视频中的基准测试结果显示,Gemini 1.5在各项指标上都比之前的模型有着显著提升,尤其在视频问答等项目中表现出色。通过与GPT-4等先进模型的对比,Gemini 1.5的优势更加明显。这些基准测试数据证明了谷歌在增强AI模型的核心能力和提高准确度方面所做的杰出工作。

💡长上下文理解

长上下文理解是指Gemini 1.5可以在超大规模的上下文中进行推理和判断。视频中演示了Gemini 1.5如何从一部44分钟的电影中准确找到某一时间点的特定画面和细节内容;对于长达330,000个标记的航天飞机转录文档,它也能很好地分析和回答相关的问题。这种长上下文理解不仅突破了人类的极限,也将为AI在处理海量数据方面发挥更大作用提供了新的可能。

💡增强的编码能力

编码能力是AI系统的一个重要指标。在视频中,Gemini 1.5展示了如何在长达480万标记的代码库中寻找特定的例子和进行相关的代码修改。这种增强后的编码能力使得Gemini 1.5不仅可以理解自然语言,还能与专业编程人员有较为相当的水平,为未来在这一领域的应用奠定了基础。

💡视频、音频和文本处理

Gemini 1.5可以单次处理长达3小时的视频、22小时的音频以及700万字的文本,并在此过程中保持非常高的准确度。视频中多次展示了该模型巧妙地从这些大规模数据中提取有价值的信息,证明了它不局限于文本,也可以有效地处理音频和视频。这种多媒体处理能力极大地增强了AI系统的灵活性和实用性。

💡训练过程

视频中透露,Gemini 1.5和之前的Gemini模型一样,是在谷歌的TPU加速器和分布式数据中心系统上进行大规模训练的。视频提到,该模型借助了各种多模态和多语种数据集,从而获得了强大的理解能力。这条训练方式的信息让我们了解到,Gemini 1.5的卓越表现是基于大量资源投入和创新的训练方法而实现的。

💡人工智能竞争

视频中多次提及Google在发布Gemini 1.5后,能够在人工智能领域占据领先地位。视频指出现有的一些AI系统根本无法实现Gemini 1.5所拥有的许多卓越能力,并期待Gemini 1.5能推动整个AI行业的竞争。可以看出,消费者不仅期望谷歌持续推出突破性的AI产品,也希望看到行业内激烈而有益的竞争,从而推动人工智能的快速发展。

💡创新应用

Gemini 1.5的诸多突破性能力为未来的AI应用奠定了基础。例如它在处理大规模视频、音频和文本方面的卓越表现,将有助于自动化内容分析等方面;它强大的多模态理解能力也可能催生新的人机交互方式。视频呼吁消费者思考如何利用这些新技术来开发更多创新应用,从而推动AI在各个领域发挥作用。

Highlights

Google 意外发布了 Gemini 1.5 模型,这是他们 Gemini 模型系列的最新迭代。

Gemini 1.5 是一个能够一次处理 3 小时视频、22 小时音频或 700 万字的庞大模型。

Gemini 1.5 的准确率达到了 99-100%,这是非常不可思议的。

Gemini 1.5 在文本、视觉和音频方面都优于 Gemini Pro 1.0。

Gemini Ultra 在视觉和音频方面比 Gemini 1.5 略强,但总体 Gemini 1.5 更强大。

Gemini 1.5 能从 432 页的阿波罗 11 号航天飞机的航天飞机文字记录中准确地检索出滑稽的片段。

只给 Gemini 1.5 一张简单的图片,它就能正确识别出该画面对应的时间点。

Gemini 1.5 能精准地在几千行代码中找出关于人物动画的例子。

Gemini 1.5 可以根据代码的改动要求,实时修改代码并预览变化效果。

给 Gemini 1.5 一个 44 分钟的巴斯特·基顿电影,它能从中找出一张纸条并精确说出上面的文字和时间戳。

根据简单的绘图,Gemini 1.5 能推断出对应电影中的具体场景和时间点。

Gemini 1.5 能根据基础材料为人类所学的水平进行英语到克曼语的准确翻译。

Gemini 1.5 能从一部超长电影的单帧图像中准确推理出对应场景。

Gemini 1.5 在长文本推理和查找方面的准确率可达 99%。

Gemini 1.5 在诸多基准测试如数学、科学推理、编码等方面均优于 GPT-4。

Transcripts

play00:00

so Google actually did just surprise

play00:02

Everyone by releasing Gemini 1.5 and

play00:05

this is their latest iteration of their

play00:08

family of Gemini models and this is a

play00:11

rather surprising model in the fact that

play00:13

it is able to do something incredible

play00:16

Gemini 1.5 is the Behemoth that is able

play00:20

to take up to 3 hours of video in a

play00:23

single context length it's also able to

play00:25

take 22 hours of audio at once it's also

play00:30

able to take up to 7 million words or 10

play00:34

million tokens with remarkable accuracy

play00:37

as well because lots of the time when we

play00:40

see new models appear many of the times

play00:42

what happens is is that their accuracy

play00:44

rates are very very underwhelming and

play00:47

Gemini is just outstanding because on

play00:50

these capabilities their accuracy rate

play00:53

is around 99 to 100% so that is

play00:57

absolutely incredible this multimotor

play00:59

model is going to change everything and

play01:02

let's take a look at some of the things

play01:04

you do need to know because once you see

play01:06

a few videos you're going to be truly

play01:08

surprised by how good this AI really is

play01:10

so where is Gemini 1.5 before we dive

play01:13

into some of the examples of how good

play01:15

this AI system is where does it fit so

play01:17

on the left hand side you can see Gemini

play01:19

Ultra our most capable and largest model

play01:21

for highly complex tasks and then in the

play01:23

middle you can see Gemini Pro and you've

play01:26

got two iterations Gemini 1.0 and Gemini

play01:28

1.5 and Gemini 1.5 is the model that was

play01:31

released today which is essentially for

play01:34

larger more tedious tasks that require a

play01:36

longer context l so how much better is

play01:39

Gemini 1.5 so you can see that in text

play01:42

in vision and in audio Gemini 1.5 is

play01:46

better across the board however compared

play01:49

to the ultra benchmarks you can see that

play01:51

only on vision and audio on the right

play01:53

hand side there are some areas where

play01:55

Ultra is slightly better so overall this

play01:59

model is substantially better than

play02:00

Gemini Pro 1.0 which is currently

play02:03

available and in terms of Gemini Ultra

play02:06

largely it's better than text and

play02:08

Invision so across the board this is a

play02:10

model that is most certainly more

play02:13

capable now I'm going to be showing you

play02:14

guys one of these examples of Gemini Pro

play02:17

1.5 reasoning across a 432 page

play02:21

transcript this is a demo of long

play02:23

context understanding an experimental

play02:25

feature in our newest model Gemini 1.5

play02:28

Pro we'll walk through screen recording

play02:30

of example prompts using a 402 page PDF

play02:33

of the Apollo 11 transcript which comes

play02:35

out to almost 330,000

play02:39

tokens we started by uploading the

play02:41

Apollo PDF into Google AI

play02:43

studio and asked find three comedic

play02:47

moments list quotes from this transcript

play02:49

and

play02:50

Emoji this screen capture is sped up

play02:53

this timer shows exactly how long it

play02:55

took to process each prompt and keep in

play02:57

mind that processing times will vary

play03:01

the model responded with three quotes

play03:03

like this one from Michael Collins I'll

play03:05

bet you a cup of coffee on

play03:08

it if we go back to the transcript we

play03:11

can see the model found this exact quote

play03:12

and extracted the comedic moment

play03:14

accurately then we tested a multimodal

play03:16

prompt we gave it this drawing of a

play03:18

scene we were thinking of and asked what

play03:21

moment is

play03:24

this the model correctly identified it

play03:26

as Neil's first steps on the moon notice

play03:29

how we didn't explain what was happening

play03:30

in the drawing simple drawings like this

play03:33

are a good way to test if the model can

play03:34

find something based on just a few

play03:36

abstract details and for the last prompt

play03:38

we ask the model to cite the time code

play03:40

of this moment in the transcript like

play03:42

all generative models responses like

play03:44

this won't always be perfect they can

play03:46

sometimes be a digit or two off but

play03:48

let's look at the model's response

play03:50

here and when we find this moment in the

play03:53

transcript we can see that this time

play03:55

code is

play03:56

correct these are just a few examples of

play03:58

what's possible with a context window of

play04:00

up to 1 million multimodal tokens in

play04:02

Gemini 1.5 Pro that demo right there was

play04:06

rather impressive and there are a lot

play04:07

more examples in the paper but let's

play04:10

take a look at another example of

play04:12

Gemini's massive capabilities on doing

play04:15

this with coding

play04:17

tasks this is a demo of long context

play04:20

understanding an experimental feature in

play04:22

our newest model Gemini 1.5 Pro we'll

play04:25

walk through some example prompts using

play04:27

the 3js example code which comes out to

play04:29

for 800,000 tokens we extracted the code

play04:32

for all of the 3js examples and put it

play04:35

together into this text file which we

play04:36

brought into Google AI Studio over here

play04:39

we asked the model to find three

play04:41

examples for learning about character

play04:43

animation the model looked across

play04:45

hundreds of examples and picked out

play04:46

these three one about blending skeletal

play04:49

animations one about poses and one about

play04:51

morph targets for facial animations all

play04:53

good choices based on our prompt in this

play04:56

test the model took around 60 seconds to

play04:58

respond to each of these prompts but

play05:00

keep in mind that latency times might be

play05:02

higher or lower as this is an

play05:03

experimental feature we're

play05:05

optimizing next we asked what controls

play05:08

the animations on the littlest Tokyo

play05:11

demo as you can see here the model was

play05:14

able to find that demo and it explained

play05:16

that the animations are embedded within

play05:17

the gltf

play05:20

model next we wanted to see if we could

play05:22

customize this code for us so we asked

play05:24

show me some code to add a slider to

play05:26

control the speed of the animation use

play05:28

that kind of gooey the other demos

play05:31

have this is what it looked like before

play05:33

on the original 3js site and here's the

play05:36

modified version it's the same scene but

play05:38

it added this little slider to speed up

play05:40

slow down or even stop the animation on

play05:42

the fly it used this gooey Library the

play05:44

other demos have set a parameter called

play05:46

animation speed and wired it up to the

play05:48

mixer in the scene like all generative

play05:50

models responses aren't always perfect

play05:53

there's actually not in a knit function

play05:54

in this demo like there is in most of

play05:55

the others however the code it gave us

play05:57

did exactly what we wanted next we tried

play06:00

a multimodal input by giving it a

play06:02

screenshot of one of the demos we didn't

play06:04

tell it anything about this screenshot

play06:06

and just asked where we could find the

play06:07

code for this demo seen over here as you

play06:10

can see the model was able to look

play06:11

through the hundreds of demos and find

play06:13

the one that matched the image next we

play06:16

asked the model to make a change to the

play06:17

scene asking how can I modify the code

play06:19

to make the terrain flatter the model

play06:21

was able to zero in on one particular

play06:23

function called generate height and

play06:25

showed us the exact line to tweak below

play06:28

the code it clearly explained how the

play06:29

change works over here in the updated

play06:32

version you can see that the terrain is

play06:33

indeed flatter just like we

play06:37

asked we tried one more code

play06:39

modification task using this 3D text

play06:42

demo over here we asked I'm looking at

play06:44

the text geometry demo and I want to

play06:46

make a few tweaks how can I change the

play06:48

text to say goldfish and make the mesh

play06:50

materials look really shiny and

play06:54

metallic you can see the model

play06:56

identified the correct demo and showed

play06:57

the precise lines in it that need to be

play06:59

tweaked further down it explained these

play07:01

material properties metalness and

play07:03

roughness and how to change them to get

play07:05

a shiny

play07:09

effect you can see that it definitely

play07:11

pulled off the task and the text looks a

play07:13

lot shinier now these are just a couple

play07:15

examples of what's possible with a

play07:17

context window of up to 1 million

play07:18

multimodal tokens in Gemini 1.5 you just

play07:22

saw Google's Gemini 1.5 Pro problem

play07:25

solving across 100,000 lines of code

play07:29

Mayo my this is something that is truly

play07:31

impressive there is no other AI system

play07:33

out there that can do this with the

play07:36

accuracy level of Google's Gemini but

play07:39

now let's take a look at some of the

play07:41

multimodal prompting which is going to

play07:43

be used by a lot of standard uses this

play07:46

is a demo of long context understanding

play07:49

an experimental feature in our newest

play07:51

model Gemini 1.5 Pro we'll walk through

play07:55

a screen recording of example prompts

play07:57

using a 44-minute but Buster Keaton film

play08:00

which comes out to over 600,000

play08:03

tokens in Google AI Studio we uploaded

play08:06

the video and asked find the moment when

play08:09

a piece of paper is removed from the

play08:11

person's pocket and tell me some key

play08:13

information on it with the time

play08:16

code this screen capture is sped up and

play08:18

this timer shows exactly how long it

play08:20

took to process each prompt and keep in

play08:23

mind that processing times will vary the

play08:26

model gave us this response explaining

play08:28

that the piece of paper is a pawn ticket

play08:30

from Goldman and Company Pawn Brokers

play08:32

with the date and cost and it gave us

play08:35

this time code

play08:36

1201 when we pulled up that time code we

play08:39

found it was correct the model had found

play08:42

the exact moment the piece of paper is

play08:44

removed from the person's pocket and it

play08:46

extract a text

play08:48

accurately next we gave it this drawing

play08:51

of a scene we were thinking of and asked

play08:53

what is the time code when this

play08:55

happens this is an example of a

play08:57

multimodal prompt where where we combine

play08:59

text and image in our

play09:01

input the model returned this time code

play09:05

1534 we pulled that up and found that it

play09:07

was the correct scene like all

play09:10

generative models responses vary and

play09:12

won't always be

play09:13

perfect but notice how we didn't have to

play09:15

explain what was happening in the

play09:17

drawing simple drawings like this are a

play09:20

good way to test if the model can find

play09:21

something based on just a few abstract

play09:24

details like it did

play09:26

here these are just a couple examples of

play09:28

what's possible with a context window of

play09:31

up to 1 million multimodal tokens in

play09:33

Gemini 1.5 Pro that right there goes to

play09:36

show us how crazy this is I think the

play09:39

only caveat to this is that it does take

play09:41

a little bit of time for it to go ahead

play09:43

and get the footage but looking through

play09:45

a 44-minute video is absolutely

play09:48

incredible and doing the reasoning

play09:50

across that is not to be understated

play09:52

because think about how long it would

play09:54

take a human to watch through an entire

play09:56

movie and find something from one frame

play09:58

and whilst these demos are impressive

play10:01

what's even more impressive is the paper

play10:03

that they attach to this which I read

play10:05

that shows a whole host of other

play10:08

incredible capabilities so let's take a

play10:10

look at some of these stunning examples

play10:12

from the paper which is going to show

play10:14

you all exactly how accurate this AI

play10:17

system really is and why Google are

play10:19

really leading the entire AI industry

play10:22

with Gemini 1.5 Pro so there was this

play10:26

example and it stated given a reference

play10:28

grammar book and bilingual word list a

play10:31

dictionary Gemini 1.5 is able to

play10:33

translate from English to kamang with

play10:36

similar quality to a human who learned

play10:39

from the same materials this is

play10:41

incredibly substantial because it means

play10:44

that not only is it able to get the

play10:47

entirety of this context length and a

play10:49

dictionary it's able to reason and do

play10:52

translation based on a new data just

play10:54

like a human would there was also this

play10:57

example right here that was was another

play10:59

stunning example from the paper and

play11:02

essentially it states that with the

play11:03

entire text of this really really long

play11:06

novel it's able to understand exactly

play11:08

what's happening just through a very

play11:11

simple drawing and I'm no artist but I'm

play11:13

sure you can all appreciate the fact

play11:15

that this drawing here isn't a very very

play11:19

artistic one and it's really really

play11:21

simple so the genius here is of this

play11:24

system to be able to understand the

play11:26

Nuance of what's happening in the image

play11:28

then extrapolate that data out and of

play11:30

course reason to figure out exactly

play11:33

where that is that is something that is

play11:35

unheard of in our current AI systems and

play11:39

that's why I stated this is truly

play11:41

gamechanging stuff there was another

play11:43

example in the paper and I'm pretty sure

play11:46

you've already seen this one based in

play11:48

the video but it just goes to show how

play11:50

crazy this is now in the paper some of

play11:53

the stuff I was looking at was really

play11:54

cool because there was this thing called

play11:56

video Hast stack okay and I'm going to

play11:59

break this down for you guys because

play12:00

it's truly fascinating on how accurate

play12:03

this really is and how they tested it

play12:05

goes to show how accurate this is now on

play12:07

the image you can see Gemini 1.5 Pro

play12:10

compared to GPT 4 with vision and

play12:14

unfortunately GPT 4 with the vision can

play12:16

only take in 3 minutes in their API

play12:18

whereas Gemini 1.5 can do 1 minute of

play12:21

content all up to the way of 3 hours so

play12:24

essentially they set up a game the

play12:26

computer which is Gemini 1.5 has to find

play12:29

a secret message and the secret word is

play12:31

needle but this message was sneakily

play12:34

hidden in one tiny part of a very long

play12:37

movie and this movie isn't just any

play12:39

movie it was a thre long hour movie made

play12:43

by sticking two copies of a documentary

play12:45

about the game of go together and this

play12:48

makes the video really long with lots of

play12:50

places that could have hidden the

play12:52

message now in this demo what they did

play12:54

was they put the secret message only in

play12:57

one single frame of the video that's

play12:59

just one picture out of a thousands and

play13:02

thousands that make up the entire movie

play13:04

and of course there's a picture every

play13:06

single second now Gemini 1.5 PR's job

play13:09

was to watch this entire super long

play13:11

movie and find that one frame with the

play13:13

secret message and all they did was they

play13:15

asked Gemini 1.5 what was this secret

play13:18

word which is essentially like finding

play13:20

an needle in a hay stack and can you

play13:22

guess what Gemini 1.5 was able to do it

play13:25

100% of the time so that is why the

play13:27

video capability

play13:29

the video hstack capabilities are

play13:31

absolutely incredible in addition they

play13:33

did the same kind of game with the

play13:36

Gemini 1.5 Pro system and they did it

play13:39

with 22 hours of footage and you can see

play13:41

here that it was able to do it up to

play13:43

100% And they compared it to whisper and

play13:46

GPT 4 Turbo with 12 minutes all the way

play13:48

up to 11 hours and you can see the boxes

play13:51

in red that are essentially areas where

play13:53

it completely failed in addition they

play13:55

also did this with text Hast stack and

play13:57

this is where things start to get crazy

play13:59

because this was something that people

play14:01

didn't really think was possible there

play14:03

were certain research papers that were

play14:04

stating that you know using Mambo was

play14:06

essentially going to be possible with

play14:08

this kind of you know output that we

play14:10

really wanted if we really wanted to be

play14:12

able to get the retrieval that we wanted

play14:13

we're going to have to use different

play14:14

architectures but it seems like Google

play14:17

managed to figure out how to do that and

play14:19

you can see right here that up to 10

play14:21

million tokens they were able to get the

play14:24

accuracy up to around you know I think

play14:26

it was 99% a ridiculous level of

play14:30

accuracy and that is something that is

play14:32

unheard of a 1 million context length

play14:34

window is incredible and of course

play14:36

compared to GPT 4 Turbo it's only a

play14:39

128,000 contact l so this is a truly

play14:42

game-changing thing because imagine

play14:44

having 1 million tokens and then getting

play14:46

an AI system to be able to reason about

play14:48

the entirety of that or find certain

play14:50

things and then reason on that that is

play14:52

going to be a huge different thing now

play14:55

there were additionally some benchmarks

play14:57

so we can see here here the comparison

play14:59

between GPT 4 vision and Gemini 1.5 Pro

play15:02

on a 1-hour video QA and experiments are

play15:05

run by sampling one video frame per

play15:07

second and linearly subsampling 16 or

play15:10

150 frames and you can see here that

play15:13

Gemini 1.5 Pro outperforms GPT 4 with

play15:16

vision substantially because not only

play15:18

does it outperform the 16 frames and the

play15:21

150 frames it does actually support for

play15:24

the video whereas GPT 4 with vision

play15:26

currently doesn't now any

play15:29

we can take a look at some of the

play15:30

benchmarks to see exactly what is going

play15:32

on you can see right here that the core

play15:34

capabilities like math science and

play15:36

reasoning and coding and instruction

play15:38

following are up across the board in

play15:41

this model and what's crazy is that in

play15:44

terms of the actual families of model

play15:46

like if we take a look at where Gemini

play15:48

Pro 1.5 sits we know that Pro 1.5 sits

play15:52

in the middle in terms of what the model

play15:54

is going to be able to do so that leads

play15:57

me to believe that potentially we could

play15:58

be getting an ultra 2.0 or an ultra 1.5

play16:02

but with these benchmarks we can see

play16:04

that Gemini 1.5 is literally better

play16:07

across the board and it has a hugely

play16:09

increased contact length that's going to

play16:11

allow a lot more things now if you want

play16:13

to take a look at some of the individual

play16:15

detailed benchmarks you can see the math

play16:18

ones right here you can see that 1.5 Pro

play16:20

outperforms it on the hell swag doesn't

play16:22

on the MML U does on the GSM 8K does on

play16:25

the math doesn't on the rest of these

play16:27

and does on the big bench so across the

play16:30

board you can see that Gemini 1.5 Pro is

play16:32

really taking the cake here in terms of

play16:35

what is possible with an AI system and

play16:37

of course in addition the detailed

play16:38

benchmarks in coding we can see that you

play16:40

know it's half and half in terms of

play16:42

these capabilities but it is

play16:44

77.7% on the natural to code benchmarks

play16:47

and one thing that I did want to find

play16:49

out was of course how they train this

play16:51

model and like Gemini 1.0 Ultra and

play16:54

Gemini 1.0 Pro Gemini 1.5 Pro was

play16:57

actually trained on multiple 496 chip

play17:00

PODS of Google's TPU for Vu accelerators

play17:03

distributed across multiple data centers

play17:05

and on a variety of multimodal and

play17:08

multilingual data now with that being

play17:10

said are you excited for Google's family

play17:12

of models that are absolutely incredible

play17:15

and are you going to be taking a look

play17:17

and using this model in Google's Ai and

play17:19

of course with things like the video

play17:21

capabilities that haven't been done by

play17:23

any other AI system before are you

play17:25

excited to potentially use these to

play17:27

reason and figure out things in certain

play17:29

videos either way I'm excited for Google

play17:31

to finally beef up the competition and

play17:34

make a more competitive AI space but it

play17:36

will be interesting to see how other AI

play17:38

companies do respond because right now

play17:40

it seems that Google is well in the lead

play17:43

benchmarks are here and the benchmarks

play17:45

are clear and some of the AI systems

play17:46

right now don't even have some of these

play17:48

capabilities so with that being said if

play17:49

you did enjoy this don't forget to leave

play17:51

your comment below on where you think

play17:53

Google is going to go

play17:57

next

Rate This

5.0 / 5 (0 votes)

Do you need a summary in English?