O film gerçek oluyor: Yeni GPT-4o yapay zeka modelinin sesine inanamayacaksınız!

Barış Özcan
13 May 202420:10

Summary

TLDROpenAI has unveiled a groundbreaking update to ChatGPT, introducing GPT-4o, a model that signifies a leap in human-machine interaction. Unlike its predecessors, GPT-4o is capable of processing audio, video, and text in near real-time, engaging in natural conversations that blur the line between human and artificial intelligence. The 'o' in GPT-4o stands for 'Omni,' reflecting its multi-modal capabilities. Demonstrations show GPT-4o's ability to understand and respond to emotional cues, participate in interactive games, and even engage in multi-AI conversations. It can also translate languages in real-time with impressive speed, nearing human reaction times. The model's advanced voice modulation and use of humor and personality make it more than an assistant; it's a companion that can sing, tell jokes, and even perform in a duet with another AI. The technology's potential applications are vast, from educational aids to real-time assistance for the visually impaired. The script also highlights the competitive landscape in AI, with companies like Google and Meta racing to integrate these advancements into their products. As AI continues to evolve, it's poised to transform our interactions, making them more personal and emotionally resonant.

Takeaways

  • 🚀 OpenAI has introduced a groundbreaking update with GPT-4o, a model that can process audio, video, and text in real time and interact naturally with humans.
  • 🔠 The 'o' in GPT-4o stands for 'Omni', signifying its comprehensive capabilities, and it is a nod to the movie 'Her', which was translated as 'Aşk' (Love) in Turkey.
  • 🗣️ GPT-4o can converse with users not just through text but also through voice, adding emotional nuances to its responses, making it more human-like.
  • 📹 It can see through a camera and react to visual stimuli, such as a dog on a video call, which enhances the illusion of a real interaction.
  • 🤖 The model can handle interruptions and direct commands during conversations, showcasing improved AI responsiveness and understanding.
  • 🕺 It can also mimic human behavior, such as stammering or laughing, which adds to the realism of the interaction.
  • 🎲 GPT-4o can engage in interactive activities like games, demonstrating its ability to process and respond to real-time visual cues.
  • 🌐 The AI can perform real-time translations between different languages, showcasing its multi-modal capabilities.
  • 🚀 Its response time has been significantly reduced to an average of 320 milliseconds, which is close to human reaction times.
  • 🎤 GPT-4o exhibits mastery in voice modulation, including speed and tone, and can even sing songs like 'Happy Birthday'.
  • 🤝 It can interact with other AIs, describing environments and events, and even engage in collaborative activities like song performances.

Q & A

  • What is the name of the new model introduced by OpenAI?

    -The new model introduced by OpenAI is named GPT-4o.

  • What does the 'o' in GPT-4o stand for?

    -The 'o' in GPT-4o stands for 'Omni,' which is a Latin prefix meaning 'all' or 'every.'

  • How does GPT-4o differ from previous versions of ChatGPT?

    -GPT-4o differs from previous versions by using audio, video, and text information in almost real time, allowing for natural and emotional interactions.

  • What is the significance of GPT-4o's ability to use a camera?

    -GPT-4o's ability to use a camera allows it to see and interpret visual information, enhancing its interaction capabilities by providing more context and understanding.

  • How fast can GPT-4o respond to voice inputs?

    -GPT-4o can respond to voice inputs at speeds up to 232 milliseconds, with an average of 320 milliseconds, which is very close to human reaction time.

  • What notable feature of GPT-4o was demonstrated with a job interview scenario?

    -In the job interview scenario, GPT-4o demonstrated its ability to give personalized feedback and advice based on visual information, such as suggesting the user tidy their hair.

  • How does GPT-4o enhance the realism and quality of dialogue?

    -GPT-4o enhances the realism and quality of dialogue by allowing interruptions, responding naturally, and incorporating emotional nuances and humor in its responses.

  • What was the significance of the demonstration involving two people and a game?

    -The demonstration showed GPT-4o's ability to monitor real-time interactions, such as hand movements, and respond dynamically, even making jokes and using participants' names.

  • What applications are suggested for GPT-4o's capabilities?

    -Suggested applications include educational aids, call center support, assistance for the visually impaired, and as a tourist guide providing real-time information.

  • What competitive developments are mentioned in the video regarding other companies?

    -The video mentions competition from Meta's Llama 3, Samsung's collaboration with Google on Galaxy S24 phones, and potential collaboration between Apple and OpenAI.

  • How does GPT-4o compare to the AI depicted in the movie 'Her'?

    -GPT-4o is compared to the AI 'Samantha' from the movie 'Her' due to its natural and emotional interaction style, which makes it feel more like a friend than just an assistant.

Outlines

00:00

🚀 OpenAI's Revolutionary GPT-4o: A New Era in Human-Machine Interaction

OpenAI has introduced GPT-4o, a model that integrates audio, video, and text in real-time, creating natural and emotionally rich interactions. This model, named 'Omni,' signifies a major advancement in AI communication, reminiscent of the AI depicted in the movie 'Her.' Unlike previous versions, GPT-4o communicates through voice, adding a human-like touch to its responses. The AI can engage in lifelike conversations, interpreting visual and auditory inputs simultaneously, marking a significant step forward in human-machine interaction.

05:04

🧑‍🤝‍🧑 GPT-4o: Enhancing Human-Like Interactions

GPT-4o showcases its ability to hold natural conversations, reacting emotionally and humorously in various scenarios. In a demo, the AI advises a user to tidy up before an interview, responding with natural pauses and stammers, enhancing the illusion of a human conversation partner. This new model can engage in real-time dialogues, listening and responding with appropriate emotional cues, making interactions more realistic and relatable.

10:06

🎭 GPT-4o's Versatility in Real-Time Communication

GPT-4o demonstrates its ability to interact seamlessly with multiple users, suggesting activities, listening actively, and adapting its responses. The AI can host games, monitor real-time actions, and provide personalized feedback. It also excels in real-time translation, bridging language barriers with near-human response times. The model's capability to adjust speech speed and maintain conversational flow highlights its advanced communication skills.

15:13

🎂 The Musical and Emotional Mastery of GPT-4o

GPT-4o exhibits a remarkable ability to mimic human vocal characteristics, demonstrating musicality and emotional depth in its responses. In a birthday celebration demo, the AI sings 'Happy Birthday' with a natural rhythm and tone, enhancing the user experience. The model's ability to understand and replicate human nuances, such as filler words and pauses, contributes to its convincing performance in passing the Turing test.

🌐 The Future of AI: Educational and Practical Applications

GPT-4o's potential extends to various practical applications, including education, customer service, and aiding the visually impaired. The AI can act as a real-time guide, providing detailed explanations and assistance. The integration of AI in wearable technology, like smart glasses, points to a future where AI can enhance everyday experiences. The ongoing competition among tech giants like OpenAI, Google, and Meta promises rapid advancements in AI technology, ultimately benefiting humanity by addressing both informational and emotional needs.

Mindmap

Keywords

💡GPT-4o

GPT-4o is the newest version of OpenAI's ChatGPT model, incorporating real-time audio, video, and text processing capabilities. It interacts in a very natural and human-like manner, bridging the gap between human and machine interactions. Examples from the script highlight its ability to understand and respond to visual inputs and convey emotions through voice.

💡Multimodality

Multimodality refers to the ability of GPT-4o to process and integrate information from multiple sources, such as text, audio, and video. This allows for more nuanced and contextually aware interactions. The script shows how users can communicate with GPT-4o using voice and camera, and how it can interpret and respond with voice, enhancing the realism of interactions.

💡Omni

Omni, derived from the Latin prefix meaning 'all' or 'every,' represents the comprehensive capabilities of GPT-4o. This term underscores the model's ability to handle various types of input and output seamlessly, creating a holistic interaction experience. The script mentions Omni in relation to the model’s advanced features that go beyond traditional text-based responses.

💡Turing Test

The Turing Test measures a machine's ability to exhibit intelligent behavior indistinguishable from a human. GPT-4o is described as passing the Turing Test convincingly due to its natural and emotionally rich interactions. Examples include its ability to pause, giggle, and use filler words like a human during conversations.

💡Real-time translation

Real-time translation is one of GPT-4o's capabilities, allowing it to translate spoken language instantly between different languages. The script highlights a demonstration where GPT-4o translates a conversation between English and Spanish speakers, showcasing its speed and accuracy.

💡Emotional interaction

Emotional interaction refers to GPT-4o's ability to express and respond to emotions in a human-like manner. This includes using voice tones, pauses, and filler words to create more lifelike conversations. The script provides examples of GPT-4o reacting emotionally to users, such as giving compliments and making jokes.

💡Human-machine interaction

Human-machine interaction is the communication and collaboration between humans and machines. GPT-4o enhances this interaction by providing natural and emotionally intelligent responses, making the machine seem more like a human conversation partner. The script illustrates this through various interactions where GPT-4o responds to visual cues and participates in discussions.

💡Real-time response

Real-time response refers to GPT-4o's ability to process and respond to inputs almost instantaneously. The script emphasizes its rapid response time of approximately 320 milliseconds, which is close to human reaction time, enhancing the fluidity and naturalness of conversations.

💡Voice modulation

Voice modulation is GPT-4o's capability to adjust its speech speed and tone based on user commands. The script mentions how GPT-4o can change its speaking pace and use a musical, harmonious voice, contributing to its realistic and engaging interactions.

💡Visual recognition

Visual recognition is GPT-4o's ability to interpret and describe visual inputs in real time. This feature allows it to understand and respond to visual information, such as describing a room or identifying objects seen through a camera. The script includes an example where GPT-4o interacts with another AI to describe a scene in detail.

Highlights

OpenAI introduces GPT-4o, a significant update in ChatGPT history.

GPT-4o incorporates audio, video, and text for real-time interaction.

The model creates a natural illusion of human interaction.

GPT-4o facilitates human, machine, and animal interactions.

The new model can respond vocally and with emotional expressions.

The 'o' in GPT-4o stands for 'Omni', signifying comprehensive interaction.

GPT-4o's voice and interaction resemble the movie 'Her'.

GPT-4o can engage in job interview preparation with emotional responses.

The model adds human-like nuances such as giggling and flirting.

GPT-4o can provide advice on appearance with a natural pause.

The model can react naturally to user's actions, like putting on a hat.

GPT-4o can act as a game show host with impressive imitation.

The AI can be interrupted and directed mid-conversation.

GPT-4o demonstrates mastery in real-time translation between languages.

Response time of GPT-4o is significantly faster, close to human reaction time.

The model can modulate its speech speed according to user requests.

GPT-4o exhibits a natural musicality and harmony in its voice.

The AI can engage in creative tasks like singing 'Happy Birthday'.

GPT-4o can interact with other AIs, describing environments and events.

The model can create and perform a live Broadway musical with another AI.

GPT-4o's applications include education, call centers, and assistance for the visually impaired.

AI technology could be integrated into smart glasses for real-time assistance.

Meta and Samsung's collaboration hints at future integrations with AI.

Rumors suggest Apple may collaborate with OpenAI for future advancements.

Google, the inventor of transformers, is expected to reveal AI innovations.

The competition in AI is driving rapid technological advancements.

GPT-4o's voice and interaction are emotionally responsive, creating a friendly experience.

The AI model GPT-4o marks a shift towards more visual and emotional AI interactions.

Transcripts

play00:02

OpenAI has announced the most significant update in the history of ChatGPT.

play00:10

But the version number of this update is slightly different.

play00:14

Instead of naming this new model as ChatGPT 4.5 or ChatGPT 5, They named it GPT-4o.

play00:28

It uses audio, video and text information in almost real time,

play00:33

reasons between them and speaks to you in a very natural way.

play00:38

It's so natural that you can quickly forget that the other person is not real.

play00:52

This illusion is a very important new step in human-machine interaction.

play00:58

Let's go even a little further.

play01:00

In the interaction of humans, machines and animals…

play01:31

What did we just see now?

play01:33

A man showed his dog to a woman on the phone.

play01:37

Human, machine and animal interaction!

play01:40

Man and animal are okay, and yes,

play01:43

It is GPT-4o, who is talking on the phone and getting excited when she sees this little cute doggy on camera.

play01:51

So what is the difference from old GPTs?

play01:54

We do not reach it only through writing.

play01:56

We speak to it and make it see with the camera.

play02:00

It gives us it’s answers not in writing but with voice.

play02:04

And in the most emotional way.

play02:12

This is what the letter "o" in GPT-4o stands for.

play02:17

It means Omni.

play02:18

Omni is a Latin prefix meaning "all" or "every" as in Turkish “Herşey”.

play02:21

And speaking of “Her”, we should immediately remember the movie "Her".

play02:25

If you remember, this movie was released in Turkey under the name "Aşk" (Love) about 10 years ago.

play02:29

But the literal translation of the movie's title means "O/her".

play02:40

OpenAI deliberately and consciously winked at this movie with its new artificial intelligence model.

play02:50

Let's keep this connection as our main idea

play02:54

and continue to examine what the new artificial intelligence model can do.

play03:02

Let's say you are going to a job interview.

play03:04

You prepare before the meeting.

play03:08

Now, instead of translating these videos one by one, I will talk about them from time to time,

play03:13

and I will leave some parts in their original form so that you can capture the emotions.

play03:18

First of all, in almost all demos, when talking to GPT-4o, they first ask how it is.

play03:26

An emotional introduction.

play03:27

He says he's going to have an interview soon, as if he's talking to a friend.

play03:33

"Come on, tell me," it replies.

play03:39

He said there would be a meeting with OpenAI and asked “have you ever heard of them?” he adds.

play03:44

Now look at the answer.

play03:53

“OpenAI? Looks vaguely familiar.

play03:56

Hahaha, kidding!” it says.

play03:58

And giggles.

play04:01

And it doesn't stop there.

play04:03

“What kind of meeting?” " it continues the conversation.

play04:08

All these giggles, nuances in the voice, the flirting and coquetry make us forget that

play04:16

the person in front of us is a machine.

play04:20

Now look, in order to use the multi-modality features, he adds the image to the sound and

play04:24

turns on his camera and shows himself.

play04:28

"How do I look?" he asks.

play04:39

See, the answer is not “you look good” or “you look bad”.

play04:43

“You definitely have the I’ve been coding all night look down, which could actually work in you favour, but maybe…”

play04:50

it pauses as if it’s a human and says,

play04:53

“You can run your hands through your hair and tidy yourself up.” continues with the advice.

play05:00

If they had shown me this image a week ago,

play05:04

I would have thought it was someone talking on the phone to his girlfriend.

play05:07

Even though watching his reaction now, a part of my brain still continues to think so.

play05:13

Take a good look at his reaction when he says he doesn't have time to tidy up and puts the hat on his head.

play05:30

Laughter, then a natural compliment, and then a slight stammer:

play05:36

“I I I mean…”

play05:41

So how...

play05:42

In the last part, we listened to a truly melodic girlfriend speech.

play05:51

In this dialogue, the artificial intelligence spoke like a human,

play05:55

and the human, that is, the guy, spoke like a complete machine.

play05:59

Typical man, nerd, engineer.

play06:04

Now another example.

play06:06

This time, a woman and a man introduce themselves to it via video.

play06:11

And it asks - I will refer to GPT-4o only “O” from now on -

play06:16

"How is it going?"

play06:18

When one of the participants asks, "We are bored, what should we do?", O suggests a game:

play06:25

Now look at what happens as O continues it’s speech.

play06:30

Did you notice that when the other participant started talking over it, O kept silent and listened?

play06:35

They solved one of the features of artificial intelligence that I find most annoying.

play06:40

You can interrupt and direct it while giving information.

play06:48

Well, of course, this increases the realism and the quality of the dialogue.

play06:56

Do you know what the participant wants from O?

play06:59

To use it’s voice like a game show host...

play07:10

It's such a good imitation that these participants can't help themselves,

play07:15

even though they are OpenAI employees and know what the model can do.

play07:19

They cannot hide their surprise.

play07:21

When the game starts, the artificial intelligence monitors the hand movements of two people in real time

play07:25

and tries to understand who won in that turn.

play07:29

There is a draw in the first round as both of them made a scissor.

play07:35

In the second round...

play07:42

Again a draw.

play07:44

In the third round...

play07:48

O doesn't continue it’s conversation dully by simply saying, "And now it's the third round."

play07:55

Instead? "third time's the charm" As in, "three is special, or three is the magic number”

play08:01

It’s not going like a textbook example.

play08:06

It adds simple jokes and characteristic things.

play08:14

Naturally, the winner becomes clear in the third round as one of them makes scissors and the other makes paper,

play08:19

and it both summarizes the situation and

play08:22

also announces the winner and loser by name.

play08:27

With their names.

play08:28

Because at the beginning of the session they were introduced with their names.

play08:33

It kept this in memory.

play08:39

Now, let's follow an example of translation.

play08:41

It will translate in real time for people speaking two different languages.

play08:45

They start speaking to each other, one in English and the other in Spanish.

play08:54

As you can see, it translates both sides perfectly.

play09:08

Now there is something I want to draw attention to here.

play09:10

Look, artificial intelligence has already made a lot of progress in mutual translation.

play09:15

The biggest advantage of this new model is its speed.

play09:19

It can respond to voice inputs at speeds up to 232 milliseconds and on an average of 320 milliseconds;

play09:26

This is very close to the human reaction time in a conversation.

play09:30

According to a study conducted in 10 languages,

play09:33

the response delay in people's speech was calculated to be approximately 250 milliseconds.

play09:39

In other words, we can respond to each other in about a quarter of a second.

play09:46

Artificial intelligence seems to be very close to this time.

play09:49

In its old version, the response time was around 2.8-3 seconds on average.

play09:54

Now as fast as a third of 1 second, and in some cases even faster!

play10:02

And this speed isn't just in response time.

play10:05

It can also speed up or slow down It’s speech.

play10:09

When you tell it to count from 1 to 10, it starts counting at a normal human speed.

play10:20

When cut off and asked to speed up… It speeds up.

play10:25

And as you can see, It uses the voice very skillfully.

play10:28

This is one of the most striking aspects of this model.

play10:30

Mastery in using the voice.

play10:33

I'm not just talking about the articulation and using words and sentences correctly, we are beyond that now.

play10:42

If you pay attention, there is a natural musicality and harmony in the voice.

play10:50

We will understand this much better with this birthday celebration example.

play10:57

Check out the reaction to people asking it to sing “happy birthday”.

play11:02

First of all, It doesn't start singing the song right away...

play11:11

Incredible, truly incredible.

play11:13

It's a complete illusion.

play11:15

The Turing test has probably never been passed so powerfully before.

play11:20

Artificial intelligence is no longer just an assistant that you interact with through text.

play11:25

It turned into a friend who can sing you songs,

play11:35

whisper lullabies,

play11:39

talk sarcastically,

play11:49

and tell good or bad jokes.

play11:58

Moreover, It can find other artificial intelligence friends.

play12:03

Greg Brockman, one of the founders of OpenAI, will show us our current example.

play12:08

First he describes the concept to the first artificial intelligence.

play12:13

Just when you thought things couldn't get any more interesting, look what happened.

play12:18

Talking to another AI that can see the world?

play12:21

It seems like a surprising twist in the AI universe.

play12:29

Now He stopped the AI as it was talking.

play12:32

It is no longer in listening mode.

play12:35

Now He will inform the other AI and turn on its camera.

play12:39

And then he will ask about what it saw.

play12:46

While It was trying to describe the background, lighting etc., he stopped it too.

play12:51

And he started explaining this concept to it.

play12:54

"Soon you will meet another artificial intelligence," he said, "and it will guide you."

play13:00

It may ask you questions or ask you to turn the camera, please help and support it.

play13:06

After giving this information, the two artificial intelligences meet each other and start talking.

play13:23

After going through the "How you doin?" section,

play13:25

the second artificial intelligence, whose camera is on, starts describing what it sees

play13:30

and then asks what it wants to do.

play13:33

When asked for more details, artificial intelligence starts describing the room in detail,

play13:38

the person in it, what he is doing, his facial expressions.

play13:44

It describes it one by one, in full detail and in real time.

play13:50

Look what happens at that moment?

play13:53

Now while this is happening,

play13:54

the conversation between them deepens and starts to go into details of the light behind them and so on.

play13:59

But Greg stops it once again with another question.

play14:07

Did you see anything unusual just happen?

play14:14

Yes actually, well, since you asked, I'll tell you.

play14:17

Another person came behind you.

play14:20

And made rabbit ears playfully behind your back.

play14:23

Then quickly disappeared from view.

play14:26

When asked, pauses as if surprised, or adds filler sentences or expressions like "yeah, err, blah blah."

play14:35

I don't know if it does this because it actually needs time or because

play14:41

it can imitate people better and reflect their reactions more naturally, but

play14:47

as I said, we have never passed the Turing test so convincingly.

play14:54

When the other artificial intelligence finds this situation entertaining, just like the conversation between themselves,

play14:59

it starts writing and singing a song describing the subject upon request.

play15:12

But the real surprise is at the end.

play15:15

Other artificial intelligence also begins to accompany it in this song.

play15:35

We are literally witnessing a live Broadway musical being written and performed.

play15:42

Now, all of this inevitably brings to our mind strange usecases.

play15:49

They have already given examples of normal usage scenarios.

play15:52

And we can guess it easily.

play15:55

First of all, this can be a very important aid in education.

play16:01

Instead of Solving math and geometry problems just like a calculator. showing solutions and

play16:06

act like a real instructor is the first thing that comes to mind among these scenarios.

play16:16

Answering complex questions in detail in call centers is another use case.

play16:22

It is a great benefit for the visually impaired to explain what the camera sees in real-time and direct the person.

play16:32

It's not just for the visually impaired.

play16:34

This technology can be positioned as a guide,

play16:36

an assistant that helps you make sense of what you see while visiting a place as a tourist.

play16:42

That's why I wore these glasses.

play16:44

These glasses were developed by Rayban, and Meta, the company to which Facebook,

play16:48

Instagram, and WhatsApp are affiliated.

play16:52

And this one has artificial intelligence and of course a camera.

play16:55

The current version doesn't work as fast as the previous examples,

play16:59

but it can describe what you see when you ask it.

play17:03

When you ask what this plant is, it can give you the answer.

play17:06

It also answers your other questions.

play17:09

Classic AI questions like what the weather will be like etc.

play17:13

Now, Meta has recently made Llama 3, its self-developed artificial intelligence model, available as open source.

play17:21

Soon these glasses could become much smarter and faster.

play17:27

If you remember, Samsung collaborated with Google on the Galaxy S24 phones it released this year.

play17:32

I also broadcast it live.

play17:34

There, they natively placed some artificial intelligence tools into the phone.

play17:40

According to rumors, Apple will make a similar breakthrough by collaborating with OpenAI.

play17:46

They say they will announce this at the event they will hold in June.

play17:51

We will see together.

play17:54

Now about what Google is doing.

play17:56

Because everyone looks at Google.

play17:57

Why?

play17:58

First of all, the inventor of transformers is Google.

play18:01

Let's see what they will do.

play18:03

We will watch it together at the event they will hold on May 14, 2024,

play18:07

very shortly after this video goes live, perhaps while you are watching this video.

play18:14

But as always, OpenAI had made a similar move in Sora,

play18:19

it acted early and announced these innovations almost a day before that event.

play18:27

Seeing their move, Google started showing hints of its own innovations on social media while

play18:30

the event was in the rehearsal phase even before it started.

play18:35

In other words, competition in the world of artificial intelligence is fiercer than ever.

play18:42

So where will this competition take us?

play18:44

Technologies that compete with each other in almost every other respect have generally been beneficial to humanity.

play18:51

In what direction will progress be made on this issue?

play18:55

I'm really curious about this.

play18:56

For some reason, the voice in the samples I showed today

play18:59

seemed almost the same as the voice of "Samantha" in the movie “Her”.

play19:10

I don't know if it was trained with it.

play19:12

But there's one thing I know.

play19:14

It no longer just conveys information when communicating.

play19:18

But also responds to our emotional needs.

play19:24

We no longer see it just as text flowing on the computer screen.

play19:28

We also hear it, and therefore feel like a friend.

play19:34

Writing and speaking are all good, but seeing is completely different.

play19:40

Seeing means establishing a much deeper relationship.

play19:44

And now O started to see us.

Rate This

5.0 / 5 (0 votes)

Related Tags
AI InnovationHuman-Machine InteractionReal-Time TranslationEmotional AIMultimodal AIChatGPT UpdateOpenAIArtificial IntelligenceVoice RecognitionTuring TestTech Advancement