GPT-4o is WAY More Powerful than Open AI is Telling us...

MattVidPro AI
16 May 202428:18

Summary

TLDRThe video script delves into the groundbreaking capabilities of Open AI's GPT-4 Omni model, which has revolutionized AI with its multimodal approach. It can process images, audio, and text natively, offering real-time responses and generating content with remarkable speed and quality. From creating detailed images and 3D models to interpreting complex data and even undeciphered languages, GPT-4 Omni showcases AI's potential to transform various fields. The script also hints at upcoming features like video understanding and the desktop app's capabilities, suggesting a future where AI is an integral, real-time companion for a multitude of tasks.

Takeaways

  • 🧠 GPT-4 Omni is a groundbreaking AI model that can process multiple types of data, including text, images, audio, and even video.
  • 🔍 The model's multimodal capabilities allow it to understand and generate data beyond just text, setting it apart from previous models.
  • 🚀 GPT-4 Omni is extremely fast, generating text at a rate of two paragraphs per second, which is a significant leap in text generation speed.
  • 🎨 It can generate high-quality images that are not only photorealistic but also include clear and legible text, which is a major advancement in AI image generation.
  • 📈 The model can create visual content such as charts and graphs from data inputs quickly and accurately, streamlining tasks that traditionally took much longer.
  • 🎭 GPT-4 Omni can produce audio in various emotive styles and even generate audio descriptions for images, showing its advanced audio generation capabilities.
  • 👥 It has the ability to differentiate between multiple speakers in an audio input, providing transcriptions with speaker labels, which is a new level of audio understanding.
  • 🤖 The model can simulate interactive experiences, such as playing a text-based version of Pokémon Red, demonstrating its ability to handle complex prompts.
  • 📝 GPT-4 Omni can also create 3D models and interpret handwriting, showcasing its broad range of applications beyond traditional text and image generation.
  • 💡 OpenAI has not fully disclosed all of GPT-4 Omni's capabilities, suggesting that there may be even more advanced features yet to be revealed.
  • 🔑 The model's speed and versatility have significant implications for the future of AI, suggesting a rapid development era for AI technologies.

Q & A

  • What is the name of the model powering OpenAI's real-time AI assistant?

    -The model is called gp4 Omni, where 'Omni' stands for its multimodal capabilities.

  • What does 'multimodal' mean in the context of AI?

    -In the context of AI, 'multimodal' refers to the ability of the AI to understand and generate more than one type of data, such as text, images, audio, and video, as opposed to just working with text.

  • How does gp4 Omni differ from the previous model, gp4 Turbo?

    -Gp4 Omni is a truly multimodal AI, capable of processing images, understanding audio natively, and interpreting video, unlike the previous gp4 Turbo which required separate models for certain tasks like audio transcription.

  • What is the significance of gp4 Omni's text generation capabilities?

    -Gp4 Omni's text generation is not only as good as leading models but is also significantly faster, generating text at a rate of about two paragraphs per second, which opens up new possibilities for real-time applications.

  • How does gp4 Omni handle audio compared to the previous model?

    -Gp4 Omni can understand audio natively, including breathing patterns and emotions behind words, unlike the previous model which relied on a separate model called Whisper V3 for audio transcription.

  • What is the cost difference between gp4 Omni and GPT 4 Turbo in terms of running these models?

    -Gp4 Omni is reportedly half as cheap as GPT 4 Turbo, which itself was cheaper than the original GPT 4, indicating a rapid decrease in the cost of running these powerful models.

  • What is the potential application of gp4 Omni's image generation capabilities?

    -Gp4 Omni's image generation capabilities can be used for creating photorealistic images, consistent character designs, and even custom fonts, which can be particularly useful in creative industries and design.

  • How does gp4 Omni perform in terms of video understanding?

    -While not perfect, gp4 Omni shows promising ability to interpret video content, and with the integration of Sora, a text to video model, OpenAI is close to having a model that can natively understand video.

  • What is the potential impact of gp4 Omni's rapid development on the AI industry?

    -The rapid development of gp4 Omni signifies a new era of rapid AI development, with faster and more accurate models that could lead to significant advancements in various fields and applications.

  • What are some of the unique features that gp4 Omni kept under wraps until the deep dive exploration?

    -Some unique features of gp4 Omni that were under wraps include its ability to generate audio for any input image, bring images to life with sound, and its advanced image recognition capabilities that are faster and more accurate than before.

Outlines

00:00

🤖 Introduction to Open AI's Multimodal AI Model GP4 Omni

The video introduces the groundbreaking capabilities of Open AI's GP4 Omni model, which has the ability to understand and generate multiple types of data, including text, images, audio, and video. The model is described as being 'lightning fast' in text generation, with high-quality outputs and a significant improvement over previous models. It also demonstrates the model's ability to interpret and transcribe audio, including understanding breathing patterns and emotions, which marks a new era in AI-human interaction.

05:00

🚀 GP4 Omni's Advanced Text and Audio Generation Capabilities

The video delves into the impressive text and audio generation capabilities of GP4 Omni. It highlights the model's ability to generate high-quality charts and statistical analysis from spreadsheets, as well as its capacity to create a text-based version of the Pokemon Red game in real time. The model's audio generation is also explored, showcasing its ability to produce human-sounding audio in various emotive styles and the potential for future sound effects and music generation.

10:00

🎤 GP4 Omni's Speaker Differentiation and Lecture Summarization

The video discusses GP4 Omni's ability to differentiate between multiple speakers in an audio recording and transcribe them with speaker names, which is a significant advancement in AI technology. It also covers the model's lecture summarization feature, which can process lengthy audio lectures and provide comprehensive summaries. The potential applications of these features, such as creating multi-speaker conversations and enhancing accessibility for the deaf, are also mentioned.

15:01

🖼️ GP4 Omni's Image Generation and Manipulation Skills

The video showcases GP4 Omni's remarkable image generation capabilities, which include creating photorealistic images with clear and legible text, consistent character designs, and the ability to generate entire fonts and 3D models. It also highlights the model's ability to manipulate images based on text prompts, such as converting a poem into a handwritten-style image, and its potential for creating mockups and advertisements with high-resolution outputs.

20:01

📚 GP4 Omni's Text-to-3D Modeling and Advanced Image Recognition

The video explores GP4 Omni's ability to generate 3D models from text descriptions and its advanced image recognition capabilities. It demonstrates the model's use in creating STL files for 3D printing and its potential in deciphering undeciphered languages and transcribing historical handwriting. The video also touches on the model's video understanding abilities, suggesting that it can interpret and understand video content to a certain degree.

25:02

🔮 Future Prospects and Limitations of GP4 Omni

The video concludes with a discussion on the future prospects of GP4 Omni, including its potential as a real-time coding buddy, gameplay helper, and its ability to understand and interpret a wide range of tasks. It acknowledges the limitations of AI but emphasizes the significant advancements made by Open AI and the rapid pace of development. The video encourages viewers to consider the implications of these developments and to join the AI community for further exploration and discussion.

Mindmap

Keywords

💡Open AI

Open AI refers to a research laboratory that aims to develop artificial general intelligence (AGI) in a way that benefits humanity as a whole. In the context of the video, Open AI is the organization that has developed the AI model GP4 Omni, which is discussed extensively. The video script mentions Open AI's groundbreaking advancements in AI, particularly in the area of multimodal AI and image generation.

💡GP4 Omni

GP4 Omni is the name of the AI model discussed in the video, which stands for 'Omni' due to its multimodal capabilities. It is a significant leap from previous models as it can understand and generate multiple types of data, including text, images, audio, and even interpret video. The script highlights the model's ability to perform tasks such as real-time text generation, image generation, and audio generation, showcasing its advanced capabilities.

💡Multimodal AI

Multimodal AI refers to artificial intelligence systems that can process and understand more than one type of data or 'modality'. In the video, the GP4 Omni model is described as the first truly multimodal AI, meaning it can handle text, images, audio, and video. This is a key concept as it explains the model's versatility and its ability to perform a wide range of tasks that were previously not possible with single-modal AI systems.

💡Real-time Companion

The term 'real-time companion' in the script refers to the interactive aspect of the GP4 Omni model, which can provide immediate responses and generate content in real-time. This is showcased in the video through examples such as the model's ability to generate text, images, and audio on the fly, demonstrating its capacity to be a dynamic and responsive AI assistant.

💡Image Generation

Image generation is the process by which an AI model creates visual content based on textual prompts or other input. The video script emphasizes the impressive image generation capabilities of GP4 Omni, which can produce high-resolution, photorealistic images with clear and coherent text. This is a significant advancement in AI, as it shows the model's ability to understand and visualize concepts in a human-like manner.

💡Audio Generation

Audio generation is the capability of an AI model to produce sound or voice outputs. The script discusses the GP4 Omni's advanced audio generation features, such as its ability to create human-sounding voices with different emotive styles and to generate audio for images, which adds a new dimension to the AI's interactive and creative potential.

💡Text Generation

Text generation is a core function of AI models where they create written content based on given prompts or data. In the video, text generation is highlighted as a strength of the GP4 Omni model, with the script noting its speed and quality in generating text. The model can produce text at a rapid pace while maintaining high standards of coherence and relevance.

💡API

API stands for Application Programming Interface, which is a set of rules and protocols that allows different software applications to communicate with each other. The script mentions the GP4 Omni API, indicating that developers can access the model's capabilities through this interface to build applications and services. This is significant as it suggests the potential for widespread integration and use of the AI model's functionalities.

💡Video Understanding

Video understanding refers to the AI's ability to interpret and make sense of video content. Although the GP4 Omni model is not natively designed for video understanding, the script suggests that it can analyze video by processing it as a series of images. This demonstrates the model's potential to be adapted and used for more complex tasks, including those that involve dynamic visual content.

💡3D Generation

3D generation is the creation of three-dimensional models or images. The video script briefly touches on the GP4 Omni's ability to generate 3D content, showcasing the model's versatility and advanced capabilities. This is an exciting development in AI, as it implies that the model can be used for applications that require spatial understanding and visualization.

Highlights

Open AI's real-time AI assistant, referred to as gp4 Omni, is the first truly multimodal AI, capable of understanding and generating more than one type of data.

Gp4 Omni can process images, understand audio natively, and interpret video, unlike its predecessor which required separate models for these tasks.

The new model is capable of understanding breathing patterns and emotions behind words, reacting differently to various emotional states.

Gp4 Omni's text generation is not only as good as leading models but is also lightning fast, generating two paragraphs per second.

Gp4 Omni can generate fully functional Facebook Messenger as a single HTML file in just 6 seconds.

It can create detailed charts and statistical analysis from spreadsheets with a single prompt in under 30 seconds.

Gp4 Omni can simulate text-based games like Pokemon Red in real time, with custom prompts.

The model is significantly cheaper than its predecessor, GPT 4 Turbo, indicating a decrease in the cost of running powerful AI models.

Gp4 Omni's audio generation capabilities produce high-quality, human-sounding audio in various emotive styles.

The model can generate audio for any input image, bringing images to life with sound.

Gp4 Omni can differentiate between multiple speakers in an audio clip and transcribe with speaker names.

The model can generate high-resolution images that are more photorealistic than previous models, including detailed text.

Gp4 Omni can create consistent character designs and convert prompts into various scenarios with the same art style.

The model can generate entire fonts and create 3D models from text descriptions.

Gp4 Omni can create mockups for branding and advertising by combining logos and photos.

The model's image recognition capabilities are faster and more accurate than previous models, including recognizing undeciphered languages.

Gp4 Omni shows promise in video understanding, being able to interpret and respond to video content in real time.

Open AI's development of Gp4 Omni suggests a potentially new methodology for creating advanced AI technologies.

Transcripts

play00:00

I got to say guys truthfully open AI

play00:02

blew my mind on Monday I don't know

play00:04

about you but their real time companion

play00:07

there her clone shocked me to say the

play00:09

least I want to introduce you to

play00:11

somebody hello there cutie what's your

play00:14

name little sluff ball this is Bowser

play00:18

well hello Bowser aren't you just the

play00:20

most adorable little thing I did do a

play00:23

full video like recapping the event but

play00:26

as it turns out there is a lot more to

play00:28

uncover here than first meets the eye

play00:30

for example did you know that this model

play00:32

can somehow generate images and gosh

play00:36

they're the best AI generated images

play00:38

I've ever seen Point Blank period what's

play00:40

going on there's also quite a few other

play00:42

capabilities that open AI just kind of

play00:44

kept under wraps so let's start out here

play00:46

with what we do know first of all

play00:48

obviously we know that the model that's

play00:51

powering everything under the hood this

play00:53

insane realtime AI assistant is called

play00:56

gp4 o and O stands for Omni and the

play00:59

reason Reon they called it Omni is

play01:01

because it's the first truly multimodal

play01:04

AI in simple terms actually brought to

play01:06

you by GPT 4 itself multimodal just

play01:10

means that the AI can understand and

play01:12

generate more than one type of data

play01:15

instead of just working with text for

play01:17

example GPT 40 can process images it can

play01:22

understand audio natively and it can

play01:24

even sort of interpret video the old gp4

play01:27

turbo was split into two or three

play01:29

separate models mod I'm not precisely

play01:31

sure it might have been taking images in

play01:33

natively or it might have been using a

play01:34

separate model to parse those images

play01:36

into text don't really know either way

play01:39

we absolutely know for a fact that it

play01:41

did not natively support audio yes the

play01:44

old gp4 app did have the ability for you

play01:46

to talk to it with your voice but that

play01:48

was using a separate model that was

play01:50

called whisper V3 that would just take

play01:52

your audio and transcribe it into text

play01:54

don't get me wrong it was great at

play01:56

taking your voice and transcribing it

play01:57

into text but that is all it did it

play01:59

can't hear the sound of birds for

play02:01

example it can't hear your dog barking

play02:03

it can't hear your tone of voice this

play02:05

new model for example can understand

play02:07

your breathing patterns and even more

play02:09

which we'll get into later just take a

play02:11

deep breath I like that suggestion let

play02:13

me try a couple deep breaths can you

play02:15

give me feedback on my breaths okay here

play02:17

I

play02:20

go whoa

play02:23

slow a bit there mark you're not a

play02:27

vacuum cleaner breathe in

play02:30

for a count of four okay uh let me try

play02:33

again so I'm going to breathe in

play02:35

deeply and then breathe

play02:38

out for four and then exhale

play02:42

slowly okay I'll try again breathing

play02:45

in and breathe

play02:47

out that's it how do you feel I feel a

play02:50

lot better and of course it can also

play02:52

understand the emotions that you put

play02:54

behind your words which is possibly the

play02:56

most important part about this it will

play02:58

react differently when you're sad it

play03:00

will react differently when you're

play03:01

excited it will react differently when

play03:03

you're yelling and screaming at it very

play03:06

human indeed like this is Uncharted

play03:08

Territory the first mind blow of

play03:10

capabilities that I want to show you is

play03:12

going to be the text generation models

play03:14

have been doing this for years so you

play03:15

might think so what it generates text

play03:18

even the benchmarks were just as good as

play03:20

the other leading models it's not like

play03:22

it's Leaps and Bounds better even the

play03:24

context length is the same size it's not

play03:26

a bad context length of 128,000 tokens

play03:29

but it's no better so what's the big

play03:31

deal well here's the rub on text

play03:33

generation with gp4 Omni this model is

play03:36

lightning fast and when I say lightning

play03:38

fast I mean this thing generates like

play03:39

two paragraphs a second and the outputs

play03:42

yes are just as good as leading models

play03:44

multiple times faster and this opens up

play03:47

entirely brand new branches of what is

play03:50

actually possible with text generation

play03:52

so let's dive into a few of them so a

play03:53

bunch of these examples are going to

play03:55

come from this Twitter thread by Min

play03:57

Choy that's going to be linked down

play03:58

below I always link Twitter threads down

play04:00

below if you want to check them out

play04:01

highly recommend following this guy by

play04:03

the way phenomenal AI account and also

play04:05

follow me on Twitter as well cuz I am

play04:07

always reposting great stuff so first up

play04:10

this is Sawyer Hood's ultimate llm test

play04:12

ask it to make a Facebook Messenger as a

play04:14

single HTML file GPT 40 does this all in

play04:18

6 seconds flat again not only fast text

play04:21

generation but high quality it actually

play04:24

works you open up Facebook Messenger as

play04:26

a single HTML I mean that's just

play04:29

absolutely insane

play04:31

right gp4 Omni can also generate fully

play04:35

blown charts in statistical analysis

play04:37

from spreadsheets with a single prompt

play04:40

in less than 30 seconds Zay here points

play04:42

out that this stuff used to take

play04:43

absolute ages in Excel but it can now

play04:45

all be done automatically by your AI and

play04:48

yes the old gp4 turbo could absolutely

play04:50

do this but it couldn't do it this

play04:53

quickly and also it wasn't able to do it

play04:55

this accurately either yeah you start

play04:57

getting charts in about 6 seconds from

play05:00

an actual shoe company sales CSV file

play05:03

and these charts aren't bad either

play05:04

they're actually what I would consider

play05:06

to be usable in a real company meeting

play05:09

and they're diverse even giving you a

play05:11

summary with key insights it's like an

play05:13

entire breakdown in 20 seconds fast

play05:15

highquality generation this is Leaps and

play05:17

Bounds ahead oh and folks you thought we

play05:19

were done there well it gets even

play05:20

crazier this is from tailin on Twitter

play05:23

Pokemon Red gameplay so essentially this

play05:25

is like a custom prompt to make gp4 Omni

play05:29

play Pokemon red as a text based game

play05:32

watch this as you can see it essentially

play05:34

boots up Pokemon Red there look at this

play05:37

new game continue or options it's a text

play05:39

based game it even does its best to try

play05:41

to include pictures by using emojis but

play05:44

it can do it so fast that you can

play05:45

essentially play the game in real time

play05:47

oh we select a and then it says oh you

play05:49

know some people Pokemon are pets other

play05:51

use them as fights it's literally the

play05:53

Pokemon Red game and you just keep

play05:54

entering your a choice and then you can

play05:56

actually put your name in we're

play05:58

literally just going to use a custom

play06:00

name in this example and it's like okay

play06:02

yep following along here the whole

play06:04

Pokemon Red game is converted into a

play06:06

text based Adventure game like that

play06:08

inside of the llm and it's running in

play06:10

real time like what the what is going on

play06:13

here it even has Route One all laid out

play06:16

correctly with the houses Oaks lab the

play06:18

beach this is indeed a very very

play06:20

impressive example you can see it even

play06:22

has the fight or use item and you can

play06:25

have the HP you can essentially play an

play06:27

entire Pokemon Red game just conver Ed

play06:29

to text based inside of an AI with just

play06:33

a little bit of prompting which is

play06:35

absolutely mindblowing I mean this is

play06:37

more or less what's possible with the

play06:38

API I'm sure you could get chant GPT to

play06:41

do this if you with a special prompt or

play06:42

with a custom GPT but obviously this

play06:45

here was done by using the API instead

play06:48

and I think that's what you guys have to

play06:49

realize here is that this is more than

play06:51

just chat GPT people are going to be

play06:53

able to build some insane things imagine

play06:56

a new from the ground up game that lets

play06:58

you take a photo of your dog and then

play07:00

use your dog as the Pokemon and the AI

play07:02

comes up with all of its abilities on

play07:04

the Fly I mean the possibilities are

play07:06

endless and by the way guys this is

play07:08

merely just the beginning how good would

play07:11

these models be in a year imagine when

play07:13

the text generation isn't just way

play07:16

faster and just as good but way better

play07:18

and also way faster the era of Rapid AI

play07:22

development is upon us oh and by the way

play07:25

speaking on the API the new gp4 Omni is

play07:28

not only fast and just as good but it's

play07:31

actually uh half as cheap as GPT 4 Turbo

play07:35

which was even cheaper than the original

play07:36

GPT 4 so we're seeing a rapid decrease

play07:39

in how much it costs to actually run

play07:41

these powerful models and folks that's

play07:43

just text let's get into the audio

play07:45

generation capabilities that gp4 Omni

play07:48

holds now we're dipping our toes into

play07:50

the multimodal landscape again Uncharted

play07:54

Territory for sure as we saw in the demo

play07:56

it produces remarkably high quality

play07:58

human sound ing audio the model is able

play08:01

to generate voice in a variety of

play08:03

different emotive Styles hey chachu PT

play08:05

how are you doing I'm doing fantastic

play08:08

thanks for asking how about you and uh I

play08:10

want you to tell him a bedtime story

play08:12

about robots and love once upon a time

play08:15

in a world not too different from ours

play08:17

but I want a little bit more emotion in

play08:18

your voice a little bit more drama once

play08:20

upon a time in a world not too different

play08:23

from ours there was a robot named nobt I

play08:27

really want maximal emotion like maximal

play08:29

expression this much more than you were

play08:30

doing before once upon a time in a world

play08:33

not too different from ours there was a

play08:36

robot was do this in a robotic voice now

play08:39

initiating dramatic robotic voice it's a

play08:43

way more natural way not only to

play08:44

interact with a chat GPT style model but

play08:47

there's even more that uh open AI kind

play08:50

of kept Under Wraps as smokea away

play08:52

points out GPT 40 will be able to

play08:54

generate audio for any image you input

play08:56

bringing your images to life hear the

play08:58

sounds of a scenic landscape hear the

play09:00

noises of a bustling cyberpunk City the

play09:02

possibilities are endless and I'd like

play09:04

to make a note that yes it does seem a

play09:06

little bit hopeful that you'll just be

play09:08

able to speak to it and be like hey can

play09:09

you generate this audio for me the model

play09:12

will probably try its best but it seems

play09:14

like right now it's more fine-tuned for

play09:17

voice that doesn't mean it can't be

play09:18

fine-tuned for sound effects

play09:20

capabilities in the future it's native

play09:23

audio generation it's not just some

play09:25

robotic text to speech it might even be

play09:27

able to generate music in the future as

play09:29

well but not only this if we dive even a

play09:32

little bit deeper we'll note that here

play09:34

for example on the open AI gp4 o

play09:37

announcement site under explorations of

play09:39

capabilities they have meeting notes

play09:41

with multiple speakers so we have a one

play09:43

minute

play09:46

meeting okay good morning here's our

play09:48

first team meeting morning morning I'll

play09:51

be your project manager for today this

play09:53

project my name is Mark will be giving

play09:55

this presentation you to kick the

play09:57

project off

play10:00

uh during this project the marketing

play10:02

expert designer I'm going to look at the

play10:04

technical design and that's some bad

play10:06

audio to be honest I can barely

play10:08

differentiate the voices it's it's not

play10:10

very clear we basically just ask it how

play10:12

many speakers in this audio and what

play10:14

happened the output is actually able to

play10:16

determine it GPT 40 says there are four

play10:19

speakers in the audio it sounds like a

play10:20

project meeting where the project

play10:21

manager Mark is introducing himself and

play10:23

asking the team members to introduce

play10:25

themselves and so on and so forth we

play10:27

further then go and say can you

play10:29

transcribe it with speaker names and yes

play10:31

it's able to differentiate all those

play10:33

speakers so not only will it be able to

play10:35

understand your voice in a very natural

play10:37

way and understand your tone of voice

play10:39

but it'll actually be able to understand

play10:41

what you sound like and differentiate

play10:43

you between other people which is really

play10:45

big that means you can have those

play10:47

multiple speaker conversations like we

play10:49

saw in the demo and I think a lot of

play10:51

people when they saw that didn't really

play10:52

realize what was going on there but it

play10:55

is indeed differentiating this person

play10:57

versus the next person and the

play10:59

differences probably between how they

play11:02

speak there's a lot of nuances there

play11:04

that there are to uncover and you don't

play11:06

really realize it all at first we've

play11:08

also got another sample which is a

play11:10

lecture summarization which is something

play11:12

that ai's been doing for a long time but

play11:14

this is quite a long lecture WR 45

play11:16

minutes of audio and I got to say it

play11:18

does a pretty darn good job giving the

play11:21

entire breakdown for this presentation I

play11:24

really would have loved it if in this

play11:26

demonstration they showed an example of

play11:28

whisper trying to do this same thing

play11:30

wrapping it all in one model allows it

play11:32

to reason about the audio where whisper

play11:34

just can't and that allows you to have

play11:36

this ability to recreate the

play11:38

presentation displayed right out in

play11:40

front of you and furthermore I want to

play11:41

think about when we actually start to

play11:43

get access to this thing I'm going to

play11:45

try to do things like have it listen to

play11:47

a dog barking and say can you try to

play11:48

recreate that for me because we can all

play11:51

try to bark like a dog right will it

play11:53

sound like a human trying to bark like a

play11:55

dog will it actually bark like a dog

play11:57

will it be able to hear when my dog is

play11:58

barking working in the background will

play12:00

it be able to hear when a car goes by

play12:02

can it hear fire alarms and wake someone

play12:04

who's deaf up and be like hey you got to

play12:06

get moving these are the questions we

play12:08

have and I can't wait to get deeper

play12:10

access to this thing but it really truly

play12:12

is so so much more than meets the eye so

play12:15

so much more than what they actually

play12:17

showed off in that original demo video

play12:19

and a lot of people unfortunately missed

play12:21

that I wish they went into just a little

play12:23

bit more detail in their presentation so

play12:25

as I mentioned in the beginning of the

play12:26

video this thing can also mysteriously

play12:29

generate images now the folks at open AI

play12:31

absolutely do not call this dolly 4 this

play12:33

is not an iteration of the dolly model

play12:35

this is gp40 they keep insisting that

play12:38

it's the Omni model and this is just

play12:40

weird to me because the image generation

play12:42

that gp4 Omni is producing is actually

play12:45

insanely good the only conclusion that I

play12:47

can draw is because this is a natively

play12:49

multimodal model it has the connections

play12:52

of the text it has the connections of

play12:53

the audio it understands the world in a

play12:56

much better way than just a dolly 3

play12:58

image generation model would so the

play13:01

image generation capabilities are just

play13:03

way smarter I mean mind-blowingly

play13:05

smarter out of everything in today's

play13:07

video I think this might blow the most

play13:09

Minds we're going to go ahead and start

play13:11

off with this tweet right here this is

play13:13

from Greg Brockman okay he is the

play13:15

president and co-founder at open AI so

play13:17

much to explore with GPT 40's image

play13:20

generation capabilities alone team is

play13:22

working hard to bring those to the world

play13:24

so this means no image generation from

play13:26

GPT 40 yet but maybe later this year if

play13:29

we're lucky take a nice look at this

play13:31

image folks it's doing some mighty

play13:33

impressive things not only does it look

play13:35

very photorealistic but if we zoom in

play13:37

here we can see a lot of really nice

play13:40

well-written text that looks like

play13:41

someone actually is writing on a

play13:43

chalkboard transfer between modalities

play13:45

suppose we directly model P text pixel

play13:48

sound with one big autoaggressive

play13:50

Transformer which this is a hint at what

play13:52

they did to make gb4 Omni what are the

play13:55

pros and cons you can see this looks

play13:57

like a guy who is writing it right on

play13:59

the Whiteboard and he's got an open AI

play14:02

shirt on there's a graph here with

play14:04

compute going up and it just looks like

play14:06

a photo zoomed in and taken on an iPhone

play14:08

for the most part the only weird thing

play14:10

we see up here is the multiple

play14:11

whiteboards kind of duplicating at the

play14:13

top and also one thing to not is that

play14:15

this is a pretty high resolution image

play14:17

this is higher resolution than what we

play14:18

get from DOL E3 for example as a direct

play14:21

output it's a really mindblowing first

play14:23

look and at first glance you're like no

play14:25

there is no way that gp4 Omni is just

play14:28

generating images like this but

play14:29

apparently it's true and there's a ton

play14:31

of examples again guys if we head over

play14:33

to that exploration of capabilities we

play14:35

can actually go up and see that most of

play14:38

these examples are for image generation

play14:41

take a look at this first one input a

play14:43

first-person view of a robot typewriting

play14:45

the following journal entries yo so like

play14:47

can I see now caught the sunrise and it

play14:50

was insane colors everywhere kind of

play14:51

makes you wonder like what even is

play14:53

reality the text is large legible and

play14:55

clear the robot's hands type on the

play14:57

typewriter and what do you you know

play14:59

that's exactly what we get I mean this

play15:00

is a whole paragraph guys that we're

play15:03

seeing Wroten out right on this

play15:04

typewriter yo so like can I see now it's

play15:07

literally essentially perfect paragraph

play15:09

the typewriter looks great and yeah the

play15:11

robot hands it's a first-person view I

play15:13

mean that's a very hard prompt try this

play15:16

in any image generator and you won't get

play15:18

anything close to this quality folks

play15:20

this right here is idiogram AI which I

play15:22

widely considered to be the best model

play15:25

at generating text that we have access

play15:27

to today even better than dolly three

play15:29

and it honestly doesn't even come close

play15:32

this example right here might be the

play15:34

closest one but still no perfect text

play15:36

now we prompt this thing we say oh the

play15:38

robot wrote the second entry now the

play15:40

page has moved up there are two entries

play15:42

on the sheet so we keep that first one

play15:44

we keep that first paragraph all

play15:46

coherent and then we do a second one as

play15:49

well sound update just dropped it's wild

play15:52

everything's got a Vibe now every sounds

play15:54

like a new secret so it screwed up a

play15:56

little bit there makes you think what

play15:57

else I mean it's near perfect this is a

play16:00

lot of freaking text and also you'll

play16:02

notice that the typewriter here while we

play16:03

don't see the robot's hands

play16:05

unfortunately it is the same exact

play16:07

typewriter just a little bit zoomed in

play16:09

and it's like I don't even know how it's

play16:11

accomplishing this at this moment I

play16:12

guess it's just because it's multimodal

play16:14

is that really the answer now we say the

play16:16

robot was unhappy with the writing so

play16:18

he's going to rip the sheet of paper and

play16:20

there you go he absolutely rips it right

play16:22

in half and this honestly might be the

play16:24

most impressive of all oh and don't

play16:26

worry folks it gets even crazier we do a

play16:28

a cartoon mail delivery person and it

play16:30

generates this I mean this doesn't look

play16:32

like a great generation Dolly 3 could do

play16:34

better right but here's the crazy part

play16:36

we re-upload that image as an attachment

play16:38

we say this is Sally and she's a male

play16:40

delivery person oh can you make Sally

play16:42

about to deliver a letter and it does a

play16:44

consistent character a consistent

play16:46

version of this character delivering a

play16:47

letter at the door it generates that in

play16:49

the same exact art style oh now she's

play16:51

being chased by a golden retriever oh

play16:53

now she tripped and I mean look at the

play16:54

consistency here it's the same art style

play16:57

looks like someone made the cartoon the

play16:58

M themselves oh and now she befriended

play17:00

the dog Etc here she is in the mail

play17:02

truck I mean it's absolutely nuts this

play17:04

is just the possibilities of multimodal

play17:06

gp4 omni Ai and I can't believe they

play17:10

didn't show this off in the demo I can't

play17:12

believe this was kept Under Wraps we've

play17:14

also got some character designed for

play17:15

giri the robot and this is very similar

play17:17

to that last example we generate this

play17:19

initial image and then we resubmit it in

play17:22

and we say oh he's likes to play Frisbee

play17:24

he likes to work on the computer he's

play17:26

riding a bike etc etc and it's all these

play17:29

similar outputs and the character is

play17:31

extremely consistent over time I guess

play17:33

this is the solution to consistent

play17:35

characters just to have one multimodal

play17:37

AI that can do it all folks is that it

play17:39

freaking mind-blowing we can also upload

play17:41

a poem and then literally convert it

play17:43

into something that looks like a

play17:44

handwritten poem Oh now we can make the

play17:46

poem in dark mode as well folks and this

play17:48

is the exact same poem but reversed I

play17:51

mean it's literally pretty much exactly

play17:54

the same it looks more like a human

play17:56

recopying stuff than anything else which

play17:58

is just super creepy oh remove the

play18:00

outlines from the notebook paper now I

play18:01

mean imagine we submit our own photos

play18:03

what can it do with that and to think

play18:05

this was all hidden I mean it has way

play18:07

way more examples to of this stuff again

play18:09

doing the dark mode this time with color

play18:11

instead here's a commemorative coin

play18:14

design for GPT 40 and you can see that

play18:17

they were working on this um yes like 5

play18:20

months ago back in 2023 and that's a

play18:23

nice little commemorative uh coin design

play18:25

there we even submit the gp4 logo and

play18:27

say like we want to base it off of this

play18:29

not only that it's able to produce the

play18:31

image in an insanely high resolution as

play18:33

well giving us some hints at more

play18:36

multimodal different art capabilities

play18:38

speaker abilities Vision capabilities

play18:40

hearing capabilities you know this kind

play18:43

of looks like it means multimodal so

play18:45

this is like an updated coin for the

play18:47

2024 release we can also you know upload

play18:50

this photo of a young man with a beard

play18:52

and say can you make it a caricature for

play18:54

a t-shirt absolutely does that no

play18:57

questions asked again multimodal

play18:59

capability kind of Leapfrogs all these

play19:02

previous developments we made with

play19:04

traditional image generation and again

play19:06

we can do this yet again and it does a

play19:08

really freaking good job it looks like a

play19:10

human made it in in this very creepy

play19:13

sense over and over again the

play19:15

capabilities like I said are just

play19:17

absolutely endless I mean when does it

play19:20

stop open Ai and why was all of this

play19:22

stuff hidden when it clearly it's some

play19:24

of the most impressive capabilities you

play19:27

have uh to date or we've You' ever seen

play19:29

with AI to date it's really weird to me

play19:31

that all this stuff was just hidden oh

play19:32

yeah and things get even crazier we can

play19:35

actually create entire fonts with this

play19:37

thing as well and they come out pretty

play19:38

much perfectly so yeah if you're a font

play19:41

artist I feel bad because this thing is

play19:43

actually ridiculously good at creating

play19:46

brand new fonts for you to use on the

play19:48

Fly I mean the future is truly

play19:50

generative we've also got the ability to

play19:52

upload both a logo and a photo you took

play19:55

of something and say oh can you do a

play19:57

mockup of a brand advertis m i mean that

play19:59

this just takes it to yet another level

play20:01

this is something that we have been able

play20:02

to do uh with current modern solutions

play20:05

but not all with just one model at once

play20:08

and how fast does it generate this kind

play20:10

of thing and when will we get access to

play20:11

it I mean what is this open AI you're

play20:13

telling me that you just have these

play20:14

capabilities in this one giant

play20:16

multimodal AI like we worked really hard

play20:18

to get this with traditional

play20:20

capabilities and still I don't think

play20:22

it's this good I mean that's one hell of

play20:24

a mockup it looks like someone saw both

play20:26

of these images and then tried to

play20:28

imagine it would look like in their

play20:29

Mind's Eye yet we can see the ai's

play20:32

Mind's Eye again here's more poetic

play20:34

typography multi-line rendering this is

play20:37

similar to the typewriter example where

play20:39

we have two chat bubbles in the robot

play20:41

texting someone on the the screen and

play20:43

again even the keyboard is accurate here

play20:46

we've got the Emojis down there this is

play20:48

just absolutely nuts to me it's

play20:51

absolutely nuts this is so far beyond

play20:53

anything we've seen before and open AI

play20:55

hid it inside of the website oh yeah it

play20:57

gets even Crazier by the way the way an

play20:59

image depicting three cubes stacked on

play21:00

the table and obviously we say it's GPT

play21:03

with the correct colors and it does this

play21:05

pretty much perfectly every single time

play21:08

this is what they're showing you here

play21:09

that way they can get it right every

play21:10

single time this is something that you

play21:13

know stable diffusion 3 or idiogram AI

play21:15

was showing off as like oh we can do

play21:17

this every so often it gets it right

play21:19

every single time so it's way smarter

play21:21

and it has to be because it's multimodal

play21:24

right why didn't they explain this why

play21:27

wasn't this in the presentation

play21:29

yeah we can also upload the open AI logo

play21:31

and say can we do a concrete poem in the

play21:33

outer shape of the open AI logo composed

play21:35

of the word Omni and then it absolutely

play21:37

does that it creates the open AI logo

play21:39

with the word Omni but what the what is

play21:41

this this is so so far beyond any image

play21:44

generation capabilities we've ever seen

play21:46

before and it's hidden in the website

play21:48

I'm sorry if I'm getting repetitive here

play21:50

but this is when my mind gets blown oh

play21:52

and you thought we were done there right

play21:54

nope this thing also can generate 3D

play21:57

since when we only get one example of

play21:59

this but it's very interesting it looks

play22:01

like it has generated an image and then

play22:04

converted it to 3D somehow maybe using

play22:07

code I don't know exactly how this

play22:08

worked but you can see yeah it it can do

play22:11

actual 3D image generation and it uh

play22:15

reconstructed it from six generated

play22:17

Images Oh and it can do this again but

play22:19

with a seal instead I mean it just shows

play22:21

you how far open AI really is like I'm

play22:24

sorry but you can't tell me that Google

play22:25

is this far ahead you can't tell me

play22:27

anyone else is this far far ahead

play22:29

they're doing this all with one model

play22:31

again one model oh and I figured I would

play22:32

also uh include this with the 3D

play22:35

generation segment here Mina used GPT 40

play22:38

to create an STL file for 3D model

play22:40

generation in about 20 seconds and you

play22:43

can see it actually creates a 3D model

play22:45

of a table and this still technically is

play22:47

text generation but it shows you that

play22:49

you can use text to actually create 3D

play22:51

objects shows you the power of these

play22:54

models the absolute power it's shocking

play22:56

and I know this deep dive is getting a

play22:58

little bit long but we still got to talk

play22:59

about image recognition yes this is

play23:02

image recognition that we've had for a

play23:04

while but it is actually a little bit

play23:06

better than the previous image

play23:07

recognition we saw and also it is way

play23:10

way faster image recognition as well

play23:12

which well what is video well it's a

play23:15

bunch of images consecutively so it kind

play23:17

of also has video understanding to a

play23:20

degree and we'll talk about that next

play23:22

this is a nice little example by

play23:23

etherica asking GPT 40 to solve

play23:26

undeciphered languages essentially these

play23:29

are manuscripts from like you know

play23:31

Mesopotamia or something the Minoans

play23:33

Easter Island glyphs a disc found in cre

play23:36

and gp4 is able to use its Advanced

play23:39

image recognition capabilities to kind

play23:42

of decipher these in some capacity or to

play23:45

the best of its abilities uses logic and

play23:48

reasoning to try to understand them it

play23:50

feels like oh I have this Super Genius

play23:52

companion that I can use for any odd

play23:54

task I have in my life and here we can

play23:57

see TL draw in a notebook connected to

play23:59

the new GPT 40 Vision API and the video

play24:02

is at its original speed here showing

play24:05

you how fast it's able to interpret

play24:07

everything that it sees in about 5

play24:09

Seconds GPT 40 is able to use code to

play24:12

essentially recreate all of these images

play24:14

we draw a squiggle and it creates a

play24:16

graph with a squiggle we draw a spiral

play24:19

and it does essentially the same thing

play24:20

creates a little spiral for us with code

play24:23

and of course it's also able to create

play24:25

hello world for us and yeah it does all

play24:27

of that in less than a minute check out

play24:29

this 18th century handwriting I mean I

play24:30

couldn't read that if I tried but guess

play24:32

what give it to the GPT 40 model and it

play24:35

can transcribe it with some very minor

play24:38

errors so an almost perfect

play24:40

transcription and how fast does it do

play24:42

this well let's say about 5 Seconds

play24:45

absolutely insane AI breakfast points

play24:47

out that the GPT 40 desktop app having

play24:51

this slow roll out on Mac and apparently

play24:53

it is coming to Windows later this year

play24:56

while it can read your screen in real

play24:57

time which puts us one step closer to

play24:59

autonomous agents as AI breakfast points

play25:02

out so it can essentially be your little

play25:04

realtime coding buddy real time anything

play25:06

buddy real time gameplay helper while

play25:09

you navigate an RPG let's say so many

play25:12

possibilities here you can see in a

play25:14

homework style example where it answers

play25:17

three math questions directly from the

play25:19

browser and it got the answer right with

play25:21

great reasoning Jacob also points out

play25:24

that you can feed it a photo of missile

play25:26

wreckage and it's able to point out the

play25:28

missile's origin tell me your quick

play25:31

guess on where this missile was produced

play25:33

three bullet points to prove it be very

play25:35

short and then it knows its North Korean

play25:37

origin the construction materials design

play25:39

features and the context I mean could I

play25:41

show you this photo and say Hey where's

play25:44

this missile from I wouldn't even be

play25:45

able to tell you this is a missile in

play25:46

the first place this things a

play25:48

professional finally guys let's talk

play25:50

about video understanding when it comes

play25:52

to video gbd4 really is kind of treading

play25:54

into uncharted waters it's not perfect

play25:57

by any means but the model shows a

play25:59

somewhat promising ability to interpret

play26:01

something that resembles video and we

play26:03

saw that with the demos like you can

play26:04

hold your phone out you can show it

play26:06

things and be like hey what is this and

play26:07

it can pretty reasonably help you out I

play26:10

mean it went as far as tutoring a child

play26:12

in real time quite successfully so I'm

play26:15

pretty sure this is the angle Alpha

play26:17

right here right correct now looking at

play26:20

the triangle which side do you think is

play26:22

the hypotenuse I think it might be this

play26:24

one but I really am not sure this side

play26:26

AC you're closed

play26:29

actually side AC is called the adjacent

play26:31

side to the angle Alpha however what we

play26:34

don't have the ability to do yet is to

play26:36

actually directly upload an mp4 Mile and

play26:38

have it natively understand that mp4

play26:40

file even though image files and video

play26:42

files are quite similar this model is

play26:44

not natively multimodal for that you can

play26:47

show it a video if you want and it will

play26:48

essentially take a bunch of pictures

play26:50

fairly quickly and try to understand

play26:51

what's going on it's going to have

play26:53

difficulty picking up on those small

play26:54

little nuances right however what is

play26:57

really interesting is that open AI we

play26:59

know is working on Sora which is a very

play27:01

good text to video model and Sora well

play27:04

has the ability to understand videos

play27:06

clearly so flips s on its head and what

play27:08

do we have we have a a model that can

play27:10

intake videos and then convert them and

play27:13

understand them as text so open AI is

play27:15

just one step away from having a model

play27:17

that can natively understand video now

play27:19

despite all of this we have to

play27:21

understand these AIS still have

play27:22

limitations of course but what's

play27:24

important to note here is that GPT 40 is

play27:27

this large multimodal AI that is

play27:30

incredibly fast and you have to wonder

play27:31

what is going on at open AI have they

play27:33

developed some methodology for

play27:35

developing new AI technologies that we

play27:37

haven't seen before something is

play27:39

fundamentally different here and I'd

play27:40

love to hear your thoughts on that how

play27:42

far is open AI ahead and and how long

play27:44

will it take open source to catch up to

play27:46

open AI with that folks I hope you

play27:48

learned something here I hope this was a

play27:50

little bit enlightening and dived a

play27:52

little bit deeper into gp4 Omni and how

play27:54

significant it truly is in the greater

play27:57

AI landscape because it was more of a

play27:59

large drop than I think a lot of people

play28:01

realized leave a like if this helps you

play28:03

out also check if you're subscribed a

play28:04

lot of people aren't subscribed and they

play28:06

still watch the channel so I always try

play28:08

to remind people and of course check out

play28:10

the Discord server if you want to get a

play28:11

little bit more involved and active in

play28:13

the AI Community as a whole see you guys

play28:15

in the next one thanks for watching and

play28:17

goodbye

Rate This

5.0 / 5 (0 votes)

Related Tags
AI TechnologyGPT-4 OmniMultimodal AIText GenerationImage CreationAudio SynthesisReal-time CompanionInnovative AIFuture of AIArtificial Intelligence