GPT-4o is WAY More Powerful than Open AI is Telling us...
Summary
TLDRThe video script delves into the groundbreaking capabilities of Open AI's GPT-4 Omni model, which has revolutionized AI with its multimodal approach. It can process images, audio, and text natively, offering real-time responses and generating content with remarkable speed and quality. From creating detailed images and 3D models to interpreting complex data and even undeciphered languages, GPT-4 Omni showcases AI's potential to transform various fields. The script also hints at upcoming features like video understanding and the desktop app's capabilities, suggesting a future where AI is an integral, real-time companion for a multitude of tasks.
Takeaways
- 🧠 GPT-4 Omni is a groundbreaking AI model that can process multiple types of data, including text, images, audio, and even video.
- 🔍 The model's multimodal capabilities allow it to understand and generate data beyond just text, setting it apart from previous models.
- 🚀 GPT-4 Omni is extremely fast, generating text at a rate of two paragraphs per second, which is a significant leap in text generation speed.
- 🎨 It can generate high-quality images that are not only photorealistic but also include clear and legible text, which is a major advancement in AI image generation.
- 📈 The model can create visual content such as charts and graphs from data inputs quickly and accurately, streamlining tasks that traditionally took much longer.
- 🎭 GPT-4 Omni can produce audio in various emotive styles and even generate audio descriptions for images, showing its advanced audio generation capabilities.
- 👥 It has the ability to differentiate between multiple speakers in an audio input, providing transcriptions with speaker labels, which is a new level of audio understanding.
- 🤖 The model can simulate interactive experiences, such as playing a text-based version of Pokémon Red, demonstrating its ability to handle complex prompts.
- 📝 GPT-4 Omni can also create 3D models and interpret handwriting, showcasing its broad range of applications beyond traditional text and image generation.
- 💡 OpenAI has not fully disclosed all of GPT-4 Omni's capabilities, suggesting that there may be even more advanced features yet to be revealed.
- 🔑 The model's speed and versatility have significant implications for the future of AI, suggesting a rapid development era for AI technologies.
Q & A
What is the name of the model powering OpenAI's real-time AI assistant?
-The model is called gp4 Omni, where 'Omni' stands for its multimodal capabilities.
What does 'multimodal' mean in the context of AI?
-In the context of AI, 'multimodal' refers to the ability of the AI to understand and generate more than one type of data, such as text, images, audio, and video, as opposed to just working with text.
How does gp4 Omni differ from the previous model, gp4 Turbo?
-Gp4 Omni is a truly multimodal AI, capable of processing images, understanding audio natively, and interpreting video, unlike the previous gp4 Turbo which required separate models for certain tasks like audio transcription.
What is the significance of gp4 Omni's text generation capabilities?
-Gp4 Omni's text generation is not only as good as leading models but is also significantly faster, generating text at a rate of about two paragraphs per second, which opens up new possibilities for real-time applications.
How does gp4 Omni handle audio compared to the previous model?
-Gp4 Omni can understand audio natively, including breathing patterns and emotions behind words, unlike the previous model which relied on a separate model called Whisper V3 for audio transcription.
What is the cost difference between gp4 Omni and GPT 4 Turbo in terms of running these models?
-Gp4 Omni is reportedly half as cheap as GPT 4 Turbo, which itself was cheaper than the original GPT 4, indicating a rapid decrease in the cost of running these powerful models.
What is the potential application of gp4 Omni's image generation capabilities?
-Gp4 Omni's image generation capabilities can be used for creating photorealistic images, consistent character designs, and even custom fonts, which can be particularly useful in creative industries and design.
How does gp4 Omni perform in terms of video understanding?
-While not perfect, gp4 Omni shows promising ability to interpret video content, and with the integration of Sora, a text to video model, OpenAI is close to having a model that can natively understand video.
What is the potential impact of gp4 Omni's rapid development on the AI industry?
-The rapid development of gp4 Omni signifies a new era of rapid AI development, with faster and more accurate models that could lead to significant advancements in various fields and applications.
What are some of the unique features that gp4 Omni kept under wraps until the deep dive exploration?
-Some unique features of gp4 Omni that were under wraps include its ability to generate audio for any input image, bring images to life with sound, and its advanced image recognition capabilities that are faster and more accurate than before.
Outlines
🤖 Introduction to Open AI's Multimodal AI Model GP4 Omni
The video introduces the groundbreaking capabilities of Open AI's GP4 Omni model, which has the ability to understand and generate multiple types of data, including text, images, audio, and video. The model is described as being 'lightning fast' in text generation, with high-quality outputs and a significant improvement over previous models. It also demonstrates the model's ability to interpret and transcribe audio, including understanding breathing patterns and emotions, which marks a new era in AI-human interaction.
🚀 GP4 Omni's Advanced Text and Audio Generation Capabilities
The video delves into the impressive text and audio generation capabilities of GP4 Omni. It highlights the model's ability to generate high-quality charts and statistical analysis from spreadsheets, as well as its capacity to create a text-based version of the Pokemon Red game in real time. The model's audio generation is also explored, showcasing its ability to produce human-sounding audio in various emotive styles and the potential for future sound effects and music generation.
🎤 GP4 Omni's Speaker Differentiation and Lecture Summarization
The video discusses GP4 Omni's ability to differentiate between multiple speakers in an audio recording and transcribe them with speaker names, which is a significant advancement in AI technology. It also covers the model's lecture summarization feature, which can process lengthy audio lectures and provide comprehensive summaries. The potential applications of these features, such as creating multi-speaker conversations and enhancing accessibility for the deaf, are also mentioned.
🖼️ GP4 Omni's Image Generation and Manipulation Skills
The video showcases GP4 Omni's remarkable image generation capabilities, which include creating photorealistic images with clear and legible text, consistent character designs, and the ability to generate entire fonts and 3D models. It also highlights the model's ability to manipulate images based on text prompts, such as converting a poem into a handwritten-style image, and its potential for creating mockups and advertisements with high-resolution outputs.
📚 GP4 Omni's Text-to-3D Modeling and Advanced Image Recognition
The video explores GP4 Omni's ability to generate 3D models from text descriptions and its advanced image recognition capabilities. It demonstrates the model's use in creating STL files for 3D printing and its potential in deciphering undeciphered languages and transcribing historical handwriting. The video also touches on the model's video understanding abilities, suggesting that it can interpret and understand video content to a certain degree.
🔮 Future Prospects and Limitations of GP4 Omni
The video concludes with a discussion on the future prospects of GP4 Omni, including its potential as a real-time coding buddy, gameplay helper, and its ability to understand and interpret a wide range of tasks. It acknowledges the limitations of AI but emphasizes the significant advancements made by Open AI and the rapid pace of development. The video encourages viewers to consider the implications of these developments and to join the AI community for further exploration and discussion.
Mindmap
Keywords
💡Open AI
💡GP4 Omni
💡Multimodal AI
💡Real-time Companion
💡Image Generation
💡Audio Generation
💡Text Generation
💡API
💡Video Understanding
💡3D Generation
Highlights
Open AI's real-time AI assistant, referred to as gp4 Omni, is the first truly multimodal AI, capable of understanding and generating more than one type of data.
Gp4 Omni can process images, understand audio natively, and interpret video, unlike its predecessor which required separate models for these tasks.
The new model is capable of understanding breathing patterns and emotions behind words, reacting differently to various emotional states.
Gp4 Omni's text generation is not only as good as leading models but is also lightning fast, generating two paragraphs per second.
Gp4 Omni can generate fully functional Facebook Messenger as a single HTML file in just 6 seconds.
It can create detailed charts and statistical analysis from spreadsheets with a single prompt in under 30 seconds.
Gp4 Omni can simulate text-based games like Pokemon Red in real time, with custom prompts.
The model is significantly cheaper than its predecessor, GPT 4 Turbo, indicating a decrease in the cost of running powerful AI models.
Gp4 Omni's audio generation capabilities produce high-quality, human-sounding audio in various emotive styles.
The model can generate audio for any input image, bringing images to life with sound.
Gp4 Omni can differentiate between multiple speakers in an audio clip and transcribe with speaker names.
The model can generate high-resolution images that are more photorealistic than previous models, including detailed text.
Gp4 Omni can create consistent character designs and convert prompts into various scenarios with the same art style.
The model can generate entire fonts and create 3D models from text descriptions.
Gp4 Omni can create mockups for branding and advertising by combining logos and photos.
The model's image recognition capabilities are faster and more accurate than previous models, including recognizing undeciphered languages.
Gp4 Omni shows promise in video understanding, being able to interpret and respond to video content in real time.
Open AI's development of Gp4 Omni suggests a potentially new methodology for creating advanced AI technologies.
Transcripts
I got to say guys truthfully open AI
blew my mind on Monday I don't know
about you but their real time companion
there her clone shocked me to say the
least I want to introduce you to
somebody hello there cutie what's your
name little sluff ball this is Bowser
well hello Bowser aren't you just the
most adorable little thing I did do a
full video like recapping the event but
as it turns out there is a lot more to
uncover here than first meets the eye
for example did you know that this model
can somehow generate images and gosh
they're the best AI generated images
I've ever seen Point Blank period what's
going on there's also quite a few other
capabilities that open AI just kind of
kept under wraps so let's start out here
with what we do know first of all
obviously we know that the model that's
powering everything under the hood this
insane realtime AI assistant is called
gp4 o and O stands for Omni and the
reason Reon they called it Omni is
because it's the first truly multimodal
AI in simple terms actually brought to
you by GPT 4 itself multimodal just
means that the AI can understand and
generate more than one type of data
instead of just working with text for
example GPT 40 can process images it can
understand audio natively and it can
even sort of interpret video the old gp4
turbo was split into two or three
separate models mod I'm not precisely
sure it might have been taking images in
natively or it might have been using a
separate model to parse those images
into text don't really know either way
we absolutely know for a fact that it
did not natively support audio yes the
old gp4 app did have the ability for you
to talk to it with your voice but that
was using a separate model that was
called whisper V3 that would just take
your audio and transcribe it into text
don't get me wrong it was great at
taking your voice and transcribing it
into text but that is all it did it
can't hear the sound of birds for
example it can't hear your dog barking
it can't hear your tone of voice this
new model for example can understand
your breathing patterns and even more
which we'll get into later just take a
deep breath I like that suggestion let
me try a couple deep breaths can you
give me feedback on my breaths okay here
I
go whoa
slow a bit there mark you're not a
vacuum cleaner breathe in
for a count of four okay uh let me try
again so I'm going to breathe in
deeply and then breathe
out for four and then exhale
slowly okay I'll try again breathing
in and breathe
out that's it how do you feel I feel a
lot better and of course it can also
understand the emotions that you put
behind your words which is possibly the
most important part about this it will
react differently when you're sad it
will react differently when you're
excited it will react differently when
you're yelling and screaming at it very
human indeed like this is Uncharted
Territory the first mind blow of
capabilities that I want to show you is
going to be the text generation models
have been doing this for years so you
might think so what it generates text
even the benchmarks were just as good as
the other leading models it's not like
it's Leaps and Bounds better even the
context length is the same size it's not
a bad context length of 128,000 tokens
but it's no better so what's the big
deal well here's the rub on text
generation with gp4 Omni this model is
lightning fast and when I say lightning
fast I mean this thing generates like
two paragraphs a second and the outputs
yes are just as good as leading models
multiple times faster and this opens up
entirely brand new branches of what is
actually possible with text generation
so let's dive into a few of them so a
bunch of these examples are going to
come from this Twitter thread by Min
Choy that's going to be linked down
below I always link Twitter threads down
below if you want to check them out
highly recommend following this guy by
the way phenomenal AI account and also
follow me on Twitter as well cuz I am
always reposting great stuff so first up
this is Sawyer Hood's ultimate llm test
ask it to make a Facebook Messenger as a
single HTML file GPT 40 does this all in
6 seconds flat again not only fast text
generation but high quality it actually
works you open up Facebook Messenger as
a single HTML I mean that's just
absolutely insane
right gp4 Omni can also generate fully
blown charts in statistical analysis
from spreadsheets with a single prompt
in less than 30 seconds Zay here points
out that this stuff used to take
absolute ages in Excel but it can now
all be done automatically by your AI and
yes the old gp4 turbo could absolutely
do this but it couldn't do it this
quickly and also it wasn't able to do it
this accurately either yeah you start
getting charts in about 6 seconds from
an actual shoe company sales CSV file
and these charts aren't bad either
they're actually what I would consider
to be usable in a real company meeting
and they're diverse even giving you a
summary with key insights it's like an
entire breakdown in 20 seconds fast
highquality generation this is Leaps and
Bounds ahead oh and folks you thought we
were done there well it gets even
crazier this is from tailin on Twitter
Pokemon Red gameplay so essentially this
is like a custom prompt to make gp4 Omni
play Pokemon red as a text based game
watch this as you can see it essentially
boots up Pokemon Red there look at this
new game continue or options it's a text
based game it even does its best to try
to include pictures by using emojis but
it can do it so fast that you can
essentially play the game in real time
oh we select a and then it says oh you
know some people Pokemon are pets other
use them as fights it's literally the
Pokemon Red game and you just keep
entering your a choice and then you can
actually put your name in we're
literally just going to use a custom
name in this example and it's like okay
yep following along here the whole
Pokemon Red game is converted into a
text based Adventure game like that
inside of the llm and it's running in
real time like what the what is going on
here it even has Route One all laid out
correctly with the houses Oaks lab the
beach this is indeed a very very
impressive example you can see it even
has the fight or use item and you can
have the HP you can essentially play an
entire Pokemon Red game just conver Ed
to text based inside of an AI with just
a little bit of prompting which is
absolutely mindblowing I mean this is
more or less what's possible with the
API I'm sure you could get chant GPT to
do this if you with a special prompt or
with a custom GPT but obviously this
here was done by using the API instead
and I think that's what you guys have to
realize here is that this is more than
just chat GPT people are going to be
able to build some insane things imagine
a new from the ground up game that lets
you take a photo of your dog and then
use your dog as the Pokemon and the AI
comes up with all of its abilities on
the Fly I mean the possibilities are
endless and by the way guys this is
merely just the beginning how good would
these models be in a year imagine when
the text generation isn't just way
faster and just as good but way better
and also way faster the era of Rapid AI
development is upon us oh and by the way
speaking on the API the new gp4 Omni is
not only fast and just as good but it's
actually uh half as cheap as GPT 4 Turbo
which was even cheaper than the original
GPT 4 so we're seeing a rapid decrease
in how much it costs to actually run
these powerful models and folks that's
just text let's get into the audio
generation capabilities that gp4 Omni
holds now we're dipping our toes into
the multimodal landscape again Uncharted
Territory for sure as we saw in the demo
it produces remarkably high quality
human sound ing audio the model is able
to generate voice in a variety of
different emotive Styles hey chachu PT
how are you doing I'm doing fantastic
thanks for asking how about you and uh I
want you to tell him a bedtime story
about robots and love once upon a time
in a world not too different from ours
but I want a little bit more emotion in
your voice a little bit more drama once
upon a time in a world not too different
from ours there was a robot named nobt I
really want maximal emotion like maximal
expression this much more than you were
doing before once upon a time in a world
not too different from ours there was a
robot was do this in a robotic voice now
initiating dramatic robotic voice it's a
way more natural way not only to
interact with a chat GPT style model but
there's even more that uh open AI kind
of kept Under Wraps as smokea away
points out GPT 40 will be able to
generate audio for any image you input
bringing your images to life hear the
sounds of a scenic landscape hear the
noises of a bustling cyberpunk City the
possibilities are endless and I'd like
to make a note that yes it does seem a
little bit hopeful that you'll just be
able to speak to it and be like hey can
you generate this audio for me the model
will probably try its best but it seems
like right now it's more fine-tuned for
voice that doesn't mean it can't be
fine-tuned for sound effects
capabilities in the future it's native
audio generation it's not just some
robotic text to speech it might even be
able to generate music in the future as
well but not only this if we dive even a
little bit deeper we'll note that here
for example on the open AI gp4 o
announcement site under explorations of
capabilities they have meeting notes
with multiple speakers so we have a one
minute
meeting okay good morning here's our
first team meeting morning morning I'll
be your project manager for today this
project my name is Mark will be giving
this presentation you to kick the
project off
uh during this project the marketing
expert designer I'm going to look at the
technical design and that's some bad
audio to be honest I can barely
differentiate the voices it's it's not
very clear we basically just ask it how
many speakers in this audio and what
happened the output is actually able to
determine it GPT 40 says there are four
speakers in the audio it sounds like a
project meeting where the project
manager Mark is introducing himself and
asking the team members to introduce
themselves and so on and so forth we
further then go and say can you
transcribe it with speaker names and yes
it's able to differentiate all those
speakers so not only will it be able to
understand your voice in a very natural
way and understand your tone of voice
but it'll actually be able to understand
what you sound like and differentiate
you between other people which is really
big that means you can have those
multiple speaker conversations like we
saw in the demo and I think a lot of
people when they saw that didn't really
realize what was going on there but it
is indeed differentiating this person
versus the next person and the
differences probably between how they
speak there's a lot of nuances there
that there are to uncover and you don't
really realize it all at first we've
also got another sample which is a
lecture summarization which is something
that ai's been doing for a long time but
this is quite a long lecture WR 45
minutes of audio and I got to say it
does a pretty darn good job giving the
entire breakdown for this presentation I
really would have loved it if in this
demonstration they showed an example of
whisper trying to do this same thing
wrapping it all in one model allows it
to reason about the audio where whisper
just can't and that allows you to have
this ability to recreate the
presentation displayed right out in
front of you and furthermore I want to
think about when we actually start to
get access to this thing I'm going to
try to do things like have it listen to
a dog barking and say can you try to
recreate that for me because we can all
try to bark like a dog right will it
sound like a human trying to bark like a
dog will it actually bark like a dog
will it be able to hear when my dog is
barking working in the background will
it be able to hear when a car goes by
can it hear fire alarms and wake someone
who's deaf up and be like hey you got to
get moving these are the questions we
have and I can't wait to get deeper
access to this thing but it really truly
is so so much more than meets the eye so
so much more than what they actually
showed off in that original demo video
and a lot of people unfortunately missed
that I wish they went into just a little
bit more detail in their presentation so
as I mentioned in the beginning of the
video this thing can also mysteriously
generate images now the folks at open AI
absolutely do not call this dolly 4 this
is not an iteration of the dolly model
this is gp40 they keep insisting that
it's the Omni model and this is just
weird to me because the image generation
that gp4 Omni is producing is actually
insanely good the only conclusion that I
can draw is because this is a natively
multimodal model it has the connections
of the text it has the connections of
the audio it understands the world in a
much better way than just a dolly 3
image generation model would so the
image generation capabilities are just
way smarter I mean mind-blowingly
smarter out of everything in today's
video I think this might blow the most
Minds we're going to go ahead and start
off with this tweet right here this is
from Greg Brockman okay he is the
president and co-founder at open AI so
much to explore with GPT 40's image
generation capabilities alone team is
working hard to bring those to the world
so this means no image generation from
GPT 40 yet but maybe later this year if
we're lucky take a nice look at this
image folks it's doing some mighty
impressive things not only does it look
very photorealistic but if we zoom in
here we can see a lot of really nice
well-written text that looks like
someone actually is writing on a
chalkboard transfer between modalities
suppose we directly model P text pixel
sound with one big autoaggressive
Transformer which this is a hint at what
they did to make gb4 Omni what are the
pros and cons you can see this looks
like a guy who is writing it right on
the Whiteboard and he's got an open AI
shirt on there's a graph here with
compute going up and it just looks like
a photo zoomed in and taken on an iPhone
for the most part the only weird thing
we see up here is the multiple
whiteboards kind of duplicating at the
top and also one thing to not is that
this is a pretty high resolution image
this is higher resolution than what we
get from DOL E3 for example as a direct
output it's a really mindblowing first
look and at first glance you're like no
there is no way that gp4 Omni is just
generating images like this but
apparently it's true and there's a ton
of examples again guys if we head over
to that exploration of capabilities we
can actually go up and see that most of
these examples are for image generation
take a look at this first one input a
first-person view of a robot typewriting
the following journal entries yo so like
can I see now caught the sunrise and it
was insane colors everywhere kind of
makes you wonder like what even is
reality the text is large legible and
clear the robot's hands type on the
typewriter and what do you you know
that's exactly what we get I mean this
is a whole paragraph guys that we're
seeing Wroten out right on this
typewriter yo so like can I see now it's
literally essentially perfect paragraph
the typewriter looks great and yeah the
robot hands it's a first-person view I
mean that's a very hard prompt try this
in any image generator and you won't get
anything close to this quality folks
this right here is idiogram AI which I
widely considered to be the best model
at generating text that we have access
to today even better than dolly three
and it honestly doesn't even come close
this example right here might be the
closest one but still no perfect text
now we prompt this thing we say oh the
robot wrote the second entry now the
page has moved up there are two entries
on the sheet so we keep that first one
we keep that first paragraph all
coherent and then we do a second one as
well sound update just dropped it's wild
everything's got a Vibe now every sounds
like a new secret so it screwed up a
little bit there makes you think what
else I mean it's near perfect this is a
lot of freaking text and also you'll
notice that the typewriter here while we
don't see the robot's hands
unfortunately it is the same exact
typewriter just a little bit zoomed in
and it's like I don't even know how it's
accomplishing this at this moment I
guess it's just because it's multimodal
is that really the answer now we say the
robot was unhappy with the writing so
he's going to rip the sheet of paper and
there you go he absolutely rips it right
in half and this honestly might be the
most impressive of all oh and don't
worry folks it gets even crazier we do a
a cartoon mail delivery person and it
generates this I mean this doesn't look
like a great generation Dolly 3 could do
better right but here's the crazy part
we re-upload that image as an attachment
we say this is Sally and she's a male
delivery person oh can you make Sally
about to deliver a letter and it does a
consistent character a consistent
version of this character delivering a
letter at the door it generates that in
the same exact art style oh now she's
being chased by a golden retriever oh
now she tripped and I mean look at the
consistency here it's the same art style
looks like someone made the cartoon the
M themselves oh and now she befriended
the dog Etc here she is in the mail
truck I mean it's absolutely nuts this
is just the possibilities of multimodal
gp4 omni Ai and I can't believe they
didn't show this off in the demo I can't
believe this was kept Under Wraps we've
also got some character designed for
giri the robot and this is very similar
to that last example we generate this
initial image and then we resubmit it in
and we say oh he's likes to play Frisbee
he likes to work on the computer he's
riding a bike etc etc and it's all these
similar outputs and the character is
extremely consistent over time I guess
this is the solution to consistent
characters just to have one multimodal
AI that can do it all folks is that it
freaking mind-blowing we can also upload
a poem and then literally convert it
into something that looks like a
handwritten poem Oh now we can make the
poem in dark mode as well folks and this
is the exact same poem but reversed I
mean it's literally pretty much exactly
the same it looks more like a human
recopying stuff than anything else which
is just super creepy oh remove the
outlines from the notebook paper now I
mean imagine we submit our own photos
what can it do with that and to think
this was all hidden I mean it has way
way more examples to of this stuff again
doing the dark mode this time with color
instead here's a commemorative coin
design for GPT 40 and you can see that
they were working on this um yes like 5
months ago back in 2023 and that's a
nice little commemorative uh coin design
there we even submit the gp4 logo and
say like we want to base it off of this
not only that it's able to produce the
image in an insanely high resolution as
well giving us some hints at more
multimodal different art capabilities
speaker abilities Vision capabilities
hearing capabilities you know this kind
of looks like it means multimodal so
this is like an updated coin for the
2024 release we can also you know upload
this photo of a young man with a beard
and say can you make it a caricature for
a t-shirt absolutely does that no
questions asked again multimodal
capability kind of Leapfrogs all these
previous developments we made with
traditional image generation and again
we can do this yet again and it does a
really freaking good job it looks like a
human made it in in this very creepy
sense over and over again the
capabilities like I said are just
absolutely endless I mean when does it
stop open Ai and why was all of this
stuff hidden when it clearly it's some
of the most impressive capabilities you
have uh to date or we've You' ever seen
with AI to date it's really weird to me
that all this stuff was just hidden oh
yeah and things get even crazier we can
actually create entire fonts with this
thing as well and they come out pretty
much perfectly so yeah if you're a font
artist I feel bad because this thing is
actually ridiculously good at creating
brand new fonts for you to use on the
Fly I mean the future is truly
generative we've also got the ability to
upload both a logo and a photo you took
of something and say oh can you do a
mockup of a brand advertis m i mean that
this just takes it to yet another level
this is something that we have been able
to do uh with current modern solutions
but not all with just one model at once
and how fast does it generate this kind
of thing and when will we get access to
it I mean what is this open AI you're
telling me that you just have these
capabilities in this one giant
multimodal AI like we worked really hard
to get this with traditional
capabilities and still I don't think
it's this good I mean that's one hell of
a mockup it looks like someone saw both
of these images and then tried to
imagine it would look like in their
Mind's Eye yet we can see the ai's
Mind's Eye again here's more poetic
typography multi-line rendering this is
similar to the typewriter example where
we have two chat bubbles in the robot
texting someone on the the screen and
again even the keyboard is accurate here
we've got the Emojis down there this is
just absolutely nuts to me it's
absolutely nuts this is so far beyond
anything we've seen before and open AI
hid it inside of the website oh yeah it
gets even Crazier by the way the way an
image depicting three cubes stacked on
the table and obviously we say it's GPT
with the correct colors and it does this
pretty much perfectly every single time
this is what they're showing you here
that way they can get it right every
single time this is something that you
know stable diffusion 3 or idiogram AI
was showing off as like oh we can do
this every so often it gets it right
every single time so it's way smarter
and it has to be because it's multimodal
right why didn't they explain this why
wasn't this in the presentation
yeah we can also upload the open AI logo
and say can we do a concrete poem in the
outer shape of the open AI logo composed
of the word Omni and then it absolutely
does that it creates the open AI logo
with the word Omni but what the what is
this this is so so far beyond any image
generation capabilities we've ever seen
before and it's hidden in the website
I'm sorry if I'm getting repetitive here
but this is when my mind gets blown oh
and you thought we were done there right
nope this thing also can generate 3D
since when we only get one example of
this but it's very interesting it looks
like it has generated an image and then
converted it to 3D somehow maybe using
code I don't know exactly how this
worked but you can see yeah it it can do
actual 3D image generation and it uh
reconstructed it from six generated
Images Oh and it can do this again but
with a seal instead I mean it just shows
you how far open AI really is like I'm
sorry but you can't tell me that Google
is this far ahead you can't tell me
anyone else is this far far ahead
they're doing this all with one model
again one model oh and I figured I would
also uh include this with the 3D
generation segment here Mina used GPT 40
to create an STL file for 3D model
generation in about 20 seconds and you
can see it actually creates a 3D model
of a table and this still technically is
text generation but it shows you that
you can use text to actually create 3D
objects shows you the power of these
models the absolute power it's shocking
and I know this deep dive is getting a
little bit long but we still got to talk
about image recognition yes this is
image recognition that we've had for a
while but it is actually a little bit
better than the previous image
recognition we saw and also it is way
way faster image recognition as well
which well what is video well it's a
bunch of images consecutively so it kind
of also has video understanding to a
degree and we'll talk about that next
this is a nice little example by
etherica asking GPT 40 to solve
undeciphered languages essentially these
are manuscripts from like you know
Mesopotamia or something the Minoans
Easter Island glyphs a disc found in cre
and gp4 is able to use its Advanced
image recognition capabilities to kind
of decipher these in some capacity or to
the best of its abilities uses logic and
reasoning to try to understand them it
feels like oh I have this Super Genius
companion that I can use for any odd
task I have in my life and here we can
see TL draw in a notebook connected to
the new GPT 40 Vision API and the video
is at its original speed here showing
you how fast it's able to interpret
everything that it sees in about 5
Seconds GPT 40 is able to use code to
essentially recreate all of these images
we draw a squiggle and it creates a
graph with a squiggle we draw a spiral
and it does essentially the same thing
creates a little spiral for us with code
and of course it's also able to create
hello world for us and yeah it does all
of that in less than a minute check out
this 18th century handwriting I mean I
couldn't read that if I tried but guess
what give it to the GPT 40 model and it
can transcribe it with some very minor
errors so an almost perfect
transcription and how fast does it do
this well let's say about 5 Seconds
absolutely insane AI breakfast points
out that the GPT 40 desktop app having
this slow roll out on Mac and apparently
it is coming to Windows later this year
while it can read your screen in real
time which puts us one step closer to
autonomous agents as AI breakfast points
out so it can essentially be your little
realtime coding buddy real time anything
buddy real time gameplay helper while
you navigate an RPG let's say so many
possibilities here you can see in a
homework style example where it answers
three math questions directly from the
browser and it got the answer right with
great reasoning Jacob also points out
that you can feed it a photo of missile
wreckage and it's able to point out the
missile's origin tell me your quick
guess on where this missile was produced
three bullet points to prove it be very
short and then it knows its North Korean
origin the construction materials design
features and the context I mean could I
show you this photo and say Hey where's
this missile from I wouldn't even be
able to tell you this is a missile in
the first place this things a
professional finally guys let's talk
about video understanding when it comes
to video gbd4 really is kind of treading
into uncharted waters it's not perfect
by any means but the model shows a
somewhat promising ability to interpret
something that resembles video and we
saw that with the demos like you can
hold your phone out you can show it
things and be like hey what is this and
it can pretty reasonably help you out I
mean it went as far as tutoring a child
in real time quite successfully so I'm
pretty sure this is the angle Alpha
right here right correct now looking at
the triangle which side do you think is
the hypotenuse I think it might be this
one but I really am not sure this side
AC you're closed
actually side AC is called the adjacent
side to the angle Alpha however what we
don't have the ability to do yet is to
actually directly upload an mp4 Mile and
have it natively understand that mp4
file even though image files and video
files are quite similar this model is
not natively multimodal for that you can
show it a video if you want and it will
essentially take a bunch of pictures
fairly quickly and try to understand
what's going on it's going to have
difficulty picking up on those small
little nuances right however what is
really interesting is that open AI we
know is working on Sora which is a very
good text to video model and Sora well
has the ability to understand videos
clearly so flips s on its head and what
do we have we have a a model that can
intake videos and then convert them and
understand them as text so open AI is
just one step away from having a model
that can natively understand video now
despite all of this we have to
understand these AIS still have
limitations of course but what's
important to note here is that GPT 40 is
this large multimodal AI that is
incredibly fast and you have to wonder
what is going on at open AI have they
developed some methodology for
developing new AI technologies that we
haven't seen before something is
fundamentally different here and I'd
love to hear your thoughts on that how
far is open AI ahead and and how long
will it take open source to catch up to
open AI with that folks I hope you
learned something here I hope this was a
little bit enlightening and dived a
little bit deeper into gp4 Omni and how
significant it truly is in the greater
AI landscape because it was more of a
large drop than I think a lot of people
realized leave a like if this helps you
out also check if you're subscribed a
lot of people aren't subscribed and they
still watch the channel so I always try
to remind people and of course check out
the Discord server if you want to get a
little bit more involved and active in
the AI Community as a whole see you guys
in the next one thanks for watching and
goodbye
Посмотреть больше похожих видео
Всё о новой нейросети GPT-4o за 7 минут!
GPT 4o - Deep Dive Review - AGI? - ChatGPT massive improvements
Riassunto di tutti gli annunci di OpenAI: GPT4o e non solo!
Adeus Alexa e Siri! Testamos o GPT-4o
Why OpenAI's Announcement Was A Bigger Deal Than People Think
O film gerçek oluyor: Yeni GPT-4o yapay zeka modelinin sesine inanamayacaksınız!
5.0 / 5 (0 votes)