They Beat Open AI to the Punch... But at What Cost?

MattVidPro AI
3 Jul 202421:48

TLDRThe video explores the capabilities of a new multimodal AI, 'mashi', which claims to understand and express emotions. Despite its less impressive intelligence compared to 'gp4 Omni', 'mashi' offers real-time voice interaction and is set to be open-sourced, allowing the community to enhance its abilities. The host tests 'mashi' in various scenarios, including singing and emotion recognition, revealing its limitations but also potential for improvement. The video also compares 'mashi' with other AIs, highlighting the differences in voice quality and interaction.

Takeaways

  • πŸ˜€ The video discusses a multimodal AI model, Mashi, which can listen and speak in real-time, similar to the GP4 Omni demo by Open AI.
  • πŸŽ‰ Despite not being as advanced as GP4 Omni, Mashi is accessible for testing and has a real-time conversational capability with a decent sounding voice.
  • πŸ” Mashi's AI is not fine-tuned and uses joint pre-training on a mix of text and audio synthetic data, which results in a less intelligent but accessible model.
  • 🌐 The model has a latency of only 200 milliseconds and can run on consumer-grade hardware, making it more accessible.
  • πŸ’‘ A significant feature of Mashi is its planned open-source release, which suggests potential for community-driven improvements.
  • πŸš€ The video compares Mashi with other AI models like Pi AI and chat GPT, noting differences in capabilities and voice quality.
  • 🎼 In a test, Mashi attempts to sing a song about butterflies but struggles with the task, highlighting its limitations in creative expression.
  • πŸ€– Mashi's understanding of emotions and tonality in voice is tested and found to be inconsistent, sometimes failing to accurately interpret emotions.
  • πŸ”„ The video shows Mashi getting stuck in loops and struggling with complex tasks, indicating it's not yet ready for prime time.
  • 🌟 The host expresses hope and excitement for the future potential of Mashi once it is open-sourced and improved upon by the community.
  • πŸ“’ The video concludes with a call for thoughts on whether the open-source community can enhance Mashi's capabilities, suggesting a collaborative approach to AI development.

Q & A

  • What is the main topic of the video script?

    -The main topic of the video script is the comparison and review of a multimodal AI named mashi, which claims to understand and express emotions, and its capabilities in comparison to other AIs like gp4 Omni and Pi AI.

  • What is the gp4 Omni Voice demo mentioned in the script?

    -The gp4 Omni Voice demo is an advanced AI technology that allows users to interact with the AI as if it were a person, capable of understanding emotions and providing a human-like voice interaction.

  • What is mashi AI's claim to fame according to the script?

    -Mashi AI claims to be a native multimodal Foundation model that can listen and speak in real time, understand and express emotions, and is set to be released as open source.

  • What is the current limitation of mashi AI as presented in the script?

    -The current limitation of mashi AI, as presented in the script, is that it is not as intelligent or responsive as gp4 Omni and struggles with tasks such as singing and accurately interpreting emotions.

  • What is the significance of mashi AI being open source?

    -The significance of mashi AI being open source is that it allows the community to access, modify, and improve the AI, potentially enhancing its capabilities and making it more useful in the future.

  • How does the script describe the voice quality of mashi AI?

    -The script describes the voice quality of mashi AI as decent and somewhat competitive, but not the best the reviewer has ever heard.

  • What is the latency of mashi AI's speech generation as mentioned in the script?

    -The latency of mashi AI's speech generation is mentioned as being only 200 milliseconds, which is quite fast.

  • What is the difference between mashi AI and Pi AI in terms of interaction as described in the script?

    -Mashi AI is described as a true multimodal AI that can interact through voice, whereas Pi AI is a separate text-to-speech model that requires typing for interaction.

  • How does the script compare the singing ability of mashi AI and Pi AI?

    -The script suggests that Pi AI, despite not being a multimodal AI, provided a better singing performance and a more realistic voice compared to mashi AI.

  • What is the reviewer's opinion on the potential of mashi AI after being open source?

    -The reviewer is excited about the potential of mashi AI after it becomes open source, hoping that the community will improve its intelligence and capabilities.

Outlines

00:00

πŸ€– Early Impressions of a Multimodal AI Demo

The script introduces an early demo of a multimodal AI, similar to the gp4 Omni, which can understand and mimic human emotions and speech in real time. The AI's capabilities are compared to gp4 Omni, with the narrator expressing disappointment at the lack of public access to such technology. The script also mentions another AI, mashi, which, while not as advanced, is accessible for testing and has a pleasant voice. The AI's ability to express and understand emotions is highlighted, along with its joint pre-training on text and audio data. The potential for mashi to be improved upon once open-sourced is a significant point of discussion.

05:00

🎀 Testing Mhi AI's Emotional Recognition and Singing Abilities

The narrator engages with the Mhi AI in an attempt to test its ability to recognize emotional tonality in the human voice and to sing. Despite the AI's claim to understand emotions, it struggles to accurately identify the narrator's emotional state. The AI is also asked to sing a song about butterflies, which it attempts to do, albeit with some awkwardness. The AI's limitations are evident, and the narrator expresses frustration with its performance, comparing it unfavorably to other AI technologies.

10:01

πŸ”Š Comparing Mhi AI with Other Text-to-Speech Models

The script compares Mhi AI with other AI models, specifically Pi AI and chat GPT, in terms of their text-to-speech capabilities and their ability to generate content like a song about butterflies. While Pi AI is praised for its realistic voice and the quality of its song, Mhi AI is found lacking in both the quality of its voice and its conversational abilities. The narrator also notes Mhi AI's inability to sing out loud and its struggle with dynamic performance, suggesting that it is not yet a competitor to more advanced AI models.

15:02

πŸ“ Seeking Storytelling Advice from Mhi AI

The narrator seeks advice from Mhi AI on writing a story, discussing the need for a protagonist, challenges, and a clear goal. Mhi AI provides generic advice that aligns with basic storytelling principles. The conversation takes a humorous turn when the AI suggests using the challenge of communicating with a 'really stupid AI' as a plot point. The AI's ability to understand and respond to the narrator's emotions is tested again, with mixed results, highlighting its inconsistent performance.

20:03

πŸ” Reflecting on Mhi AI's Potential and Open Source Future

The script concludes with the narrator reflecting on Mhi AI's current state and its potential after being open-sourced. Despite its current shortcomings, the narrator is optimistic about the improvements the open-source community could bring to Mhi AI. The video ends with the narrator expressing excitement for the future of multimodal AI and a tease for an upcoming AI news recap, inviting viewers to stay tuned for more content.

Mindmap

Keywords

gp4 Omni

gp4 Omni refers to a hypothetical advanced AI system mentioned in the video script. It is portrayed as having human-like conversational abilities, including understanding emotions, which is a central theme of the video. The script discusses a demo where gp4 Omni could interact with users in a very natural and engaging manner, setting a benchmark for the capabilities of AI in communication.

multimodal model

A multimodal model in the context of AI refers to a system that can process and understand multiple types of data or 'modalities', such as text, audio, and visual inputs. The video discusses 'mhi AI', which is described as a native multimodal Foundation model capable of listening and speaking in real time, indicating a step towards more integrated and human-like AI interactions.

open source

Open source in the video script refers to the practice of making the source code of a software or model freely available for anyone to use, modify, and distribute. The script highlights that 'mhi AI' will be released as open source, suggesting that the community can contribute to its development and potentially enhance its capabilities, which is a significant aspect of the video's narrative about the future of AI.

emotions

Emotions in the video script are discussed in relation to the AI's ability to understand and respond to human emotional states. The AI's capability to recognize and express emotions is a key feature of the gp4 Omni demo and is also tested with 'mhi AI', although with less success, indicating the complexity and importance of emotional intelligence in AI development.

text to speech

Text to speech (TTS) is a technology that converts written text into spoken words. The video script mentions the use of TTS in the context of AI systems like 'pi.a.t talk' and 'chat GPT', which can generate audio from text inputs. This technology is crucial for AI systems aiming to communicate audibly with users, as demonstrated in the comparisons made within the script.

AI intelligence

AI intelligence in the script refers to the cognitive capabilities of AI systems, such as understanding, reasoning, and learning. The video discusses the varying levels of intelligence in different AI models, with 'mhi AI' being described as less advanced compared to others like 'gp4 Omni' or 'GPT', emphasizing the ongoing development and competition in the field of AI.

bedtime story

A bedtime story is a narrative told to someone, usually children, before they go to sleep. In the script, the concept is used metaphorically to request a story about robots and love from the AI, serving as an example of the AI's creative and interactive capabilities, and highlighting the human-like qualities that are sought after in AI development.

singing

Singing in the video script is used as a test of the AI's ability to generate and perform music. The AI's attempt to sing a song about butterflies is a way to evaluate its creative expression and the quality of its text-to-speech capabilities, showcasing the potential for AI to engage in more artistic and emotive tasks.

queue

A queue in the context of the video refers to a waiting list or line of people waiting for their turn to access a service or product. The script mentions that to test 'mhi AI', one has to enter a queue with their email, illustrating the demand and access control mechanisms for new and popular AI technologies.

Foundation model

A Foundation model in AI denotes a type of pre-trained model that can be fine-tuned for various tasks. The script refers to 'mhi AI' as a 'native multimodal Foundation model', indicating that it is designed to be versatile and adaptable to different applications, which is a key aspect of modern AI development strategies.

Highlights

GP4 Omni's voice demo showcased an AI that can understand and express human-like emotions.

The disappointment in the lack of public access to GP4 Omni after its demo.

Introduction of a similar technology that has been released as a demo, accessible to the public.

The new AI, Mashi, is not as smart as GP4 Omni but has a good sounding voice and real-time conversation capabilities.

Mashi is a native multimodal Foundation model that can listen and speak, similar to GP4 Omni.

Mashi's ability to express and understand emotions is mentioned, though not yet demonstrated.

Mashi uses joint pre-training on text and audio synthetic data from Helium 7bl, llm.

Mashi's voice has a latency of only 200 milliseconds and can run on consumer-grade hardware.

Mashi will be released open source, allowing the community to improve and customize it.

A live demo of Mashi's conversational abilities and its struggles with understanding emotions.

Mashi's attempt to sing a song about butterflies, showcasing its text-to-speech capabilities.

Comparison of Mashi's performance with other AI like Pi AI and chat GPT in terms of voice and interaction.

Mashi's difficulty in understanding and responding to emotional cues in the user's voice.

The potential for the open source community to enhance Mashi's intelligence and capabilities.

The contrast between Mashi's current state and the anticipated capabilities of GP4 Omni.

The user's experience and thoughts on Mashi's potential after being open source.

Final thoughts on the exploratory nature of Mashi AI and its current limitations.