How To Build & Test AI Voice Agents (Vapi x Make x GPT-4o)

Jonas Cohrs
5 Jun 202423:07

TLDRIn this video, Jonas from Agentic Ventures demonstrates how to set up an AI voice assistant to automate customer service for a sushi restaurant. He discusses the current capabilities and improvements in AI voice systems, particularly the new GPT-4 model, which is faster and more cost-effective. Jonas outlines the process of creating an AI assistant using platforms like Vapi, Make, and Google Sheets, emphasizing the importance of effective prompting and testing. The video includes a practical example, testing the assistant with multiple customer orders, and analyzing the results to calculate an accuracy score. Jonas concludes by highlighting the potential of AI voice assistants and the upcoming advancements that will further revolutionize the industry.


  • 😀 Voice AI is becoming more powerful and is improving week by week, making widespread use more realistic.
  • 🤖 The newest AI model, GPT-4o, has increased voice and vision capabilities and is faster and cheaper than its predecessor.
  • 🔍 GPT-4o can respond in an average of 320 milliseconds, similar to human responses, and has superior speech recognition and translation performance.
  • 🎙️ AI voice systems can now act on information given to them, executing functions and acting autonomously.
  • 📋 The script provides examples of functions AI voice assistants can perform, such as scheduling appointments and controlling smart home devices.
  • 🍣 The example of automating inbound phone calls for a sushi restaurant is used to illustrate the setup and testing of an AI voice assistant.
  • 👥 Challenges faced by restaurants, such as high labor costs and language barriers, can be addressed with an AI voice assistant.
  • 🛠️ Prompting is crucial for creating an accurate and reliable AI voice assistant, with the script offering a guide for structuring prompts.
  • 📈 Systematic testing is necessary to ensure the reliability and accuracy of AI assistants, with the script detailing a method for testing and validation.
  • 📊 The script demonstrates how to use platforms like VAPI, Make, and Google Sheets to build, test, and analyze an AI voice assistant.
  • 📝 The importance of tracking and comparing information from calls to ensure accurate execution of functions is highlighted.

Q & A

  • What is the main purpose of the video by Jonas from Agentic Ventures?

    -The main purpose of the video is to demonstrate how to set up an AI Voice Assistant to automate inbound phone calls, using the example of a sushi restaurant, and to show the results of testing this system multiple times.

  • What challenges are restaurants typically facing that an AI Voice Assistant could help solve?

    -Restaurants often face challenges such as high labor costs, staff shortages, disruptions due to phone calls, difficulties handling customer inquiries during peak times, handling phone calls outside of business hours, language barriers, and the need for additional staff to handle phone calls due to language issues.

  • What was the newest model released by OpenAI that Jonas mentioned in the video?

    -The newest model released by OpenAI mentioned in the video is GPT-4, which has increased voice and vision capabilities and is twice as fast as the previous model, GPT-4 Turbo, while also being 50% cheaper for both input and output token generation.

  • What are the key features of the upcoming single end-to-end model that includes text, vision, and audio?

    -The upcoming single end-to-end model will include capabilities for text, vision, and audio in one model, allowing for more natural conversation responses with appropriate tone, emotion, expressiveness, and the ability to converse, sing, and handle interruptions.

  • What is the average response time of the new model mentioned in the video, and how does it compare to human responses?

    -The new model mentioned in the video has an average response time of 320 milliseconds, which is similar to human responses in natural conversation.

  • What is the significance of using a pipeline of three separate models for setting up an AI Voice Assistant?

    -Using a pipeline of three separate models (speech to text, language model, and text to speech) allows for the conversion of audio to text, processing of the text for a response, and then converting the response back to audio. However, this can result in the loss of information between models and limit the assistant's capabilities.

  • What are some of the functions that AI voice systems can perform autonomously based on the information given to them?

    -AI voice systems can perform functions such as scheduling appointments, tracking orders, collecting feedback, checking product inventory status, controlling smart home devices, and conducting user interviews, among other use cases.

  • What is the importance of prompting in the context of creating an AI Voice Assistant?

    -Prompting is key in creating an AI Voice Assistant because it helps structure the interaction, guiding the model to understand its identity, maintain a professional and polite tone, and accomplish specific tasks during the conversation.

  • How does the video script describe the process of testing the AI Voice Assistant for a sushi restaurant?

    -The process involves creating example customer orders, storing data from executed functions and conversation scripts in a Google spreadsheet, and using a systematic approach to test the assistant multiple times to ensure reliability and accuracy.

  • What is the role of the webhook module in integrating the AI Voice Assistant with Google Sheets?

    -The webhook module is used to save the data from the AI Voice Assistant, such as order details and conversation scripts, into a Google spreadsheet for analysis and validation.

  • What is the accuracy score achieved by the AI Voice Assistant in the video, and what does it indicate?

    -The AI Voice Assistant achieved an accuracy score of 70% in the video, indicating that in 7 out of 10 cases, the assistant correctly identified order items and address information. This score provides a measure of the assistant's performance and reliability.



🤖 Introduction to AI Voice Assistants

Jonas, the founder of Agentic Ventures, introduces the concept of AI voice assistants and their growing capabilities. He discusses the current state of AI, noting its continuous improvement and the recent release of the GPD 40 Omni model by Open AI, which offers enhanced voice and vision capabilities at a lower cost. Jonas also mentions the upcoming release of a single model that integrates text, vision, and audio, which will likely improve response times and conversational abilities. The video aims to demonstrate setting up an AI voice assistant for a sushi restaurant to automate inbound calls, with an emphasis on the importance of testing and tuning AI systems for reliability.


🔧 Setting Up AI Voice Assistants for Businesses

The script details the process of setting up an AI voice assistant using a developer platform like bar.a, which allows for the integration of new models and data storage via web hooks. Jonas explains the importance of prompting to guide the AI's responses and behavior, providing examples of how to structure prompts for a helpful virtual assistant. He also discusses the need for systematic testing with multiple scenarios to ensure the assistant's reliability and accuracy, using a webhook to record and analyze conversations in Google Sheets. The video includes a demonstration of the assistant setup in the VP dashboard, highlighting the selection of models, voices, and configurations.


📝 Automating Inbound Calls for a Sushi Restaurant

The script presents an example of automating inbound calls for a sushi restaurant using an AI voice assistant. It outlines the challenges faced by restaurants, such as high labor costs, staff shortages, and inefficiencies caused by phone calls. Jonas demonstrates how an AI assistant can address these issues by automating tasks like scheduling appointments, tracking orders, and controlling smart home devices. The video also covers the creation of a function within the assistant to send order information to a webhook and validate addresses using a Google Sheets automation pipeline.


📉 Testing and Analyzing AI Assistant Performance

Jonas describes the testing process for the AI assistant, using 10 example customer orders to evaluate its performance. He explains the use of a webhook to store data from the executed functions and conversation scripts in Google Sheets for analysis. The video shows how to create validation checks and calculate an accuracy score to manage client expectations. Jonas also shares the results of the testing, highlighting both successful and challenging interactions, and discusses the need for patience and further testing in different environments.


🚀 Future of AI Voice Assistants and Conclusion

The script concludes with a discussion on the future of AI voice assistants, anticipating the release of the single end-to-end model for GPT 40, which is expected to significantly improve AI capabilities. Jonas reflects on the testing process and the current limitations of AI assistants, suggesting that they are best suited for simpler tasks at this stage. He encourages viewers to start building AI voice assistance applications to be ready for new advancements and thanks them for watching, inviting feedback and suggestions for future content.



💡AI Voice Assistant

An AI Voice Assistant is a software application that uses artificial intelligence to understand and respond to spoken language, enabling a hands-free or screen-free interaction. In the video, the AI Voice Assistant is central to automating customer service tasks such as handling inbound phone calls for a sushi restaurant. It demonstrates the capability of modern AI to interpret and respond to natural language, facilitating tasks like taking orders and managing customer inquiries.


Automation refers to the process of making an action or process operate automatically. In the context of the video, automation is used to enhance efficiency in customer service, marketing, and sales by leveraging AI agents. The goal is to reduce the need for human intervention in routine tasks, such as answering phone calls and processing orders, thereby saving time and resources.

💡Speech to Text Model

A Speech to Text Model is a type of AI that converts spoken language into written text. In the video, this model is essential for the AI Voice Assistant to understand and process the audio input from customers. It's part of the pipeline that enables the AI to engage in a conversation, capturing the details of the customer's order accurately.

💡LLM (Large Language Model)

A Large Language Model (LLM) is a complex AI system designed to process and generate human-like text based on the input it receives. In the video, the LLM is used to generate responses to customer inquiries after the speech-to-text model has converted the audio to text. It's a key component in creating a natural and helpful conversational experience with the AI Voice Assistant.

💡Text to Speech Model

A Text to Speech Model is an AI system that converts written text into spoken language. In the video, this model is used to allow the AI Voice Assistant to communicate responses back to the customer in a natural-sounding voice. It's the final stage of the pipeline that enables two-way communication with the AI.


In the context of AI and the video, a pipeline refers to a sequence of processes or models working together to achieve a task. The AI Voice Assistant uses a pipeline of three models: Speech to Text, LLM, and Text to Speech, to handle customer interactions, from understanding the spoken words to responding verbally.


Prompting in AI refers to the technique of providing the model with specific information or questions to guide its responses. In the video, prompting is crucial for structuring the conversation with the AI Voice Assistant, ensuring it understands its role as a virtual assistant and maintains a professional and polite tone throughout the interaction.


GPT-4, as mentioned in the video, is a hypothetical next-generation model of the GPT (Generative Pre-trained Transformer) series, known for its advanced capabilities in language understanding and generation. The video discusses the potential of GPT-4 to include voice, vision, and text capabilities in a single model, which would be a significant advancement for AI Voice Assistants.

💡Webhook Module

A Webhook Module is a way for an application to provide other applications with real-time information by sending data directly to a specified URL. In the video, the Webhook Module is used to save data from the AI Voice Assistant's interactions, such as order details, into a Google spreadsheet, allowing for further analysis and record-keeping.

💡Accuracy Score

The Accuracy Score in the video refers to a measure of how correctly the AI Voice Assistant captures and processes the information during its interactions. It's calculated based on the successful identification of order items, names, and addresses during testing. The video demonstrates the importance of this metric in evaluating and improving the performance of the AI system.

💡Testing and Tuning

Testing and Tuning are iterative processes used to evaluate and refine the performance of an AI system. In the video, these processes are essential for ensuring the AI Voice Assistant accurately captures and responds to customer orders. The video emphasizes the need for multiple tests and adjustments to improve reliability and effectiveness.


Introduction to setting up AI Voice Assistants for automating inbound phone calls with an example for a sushi restaurant.

Current state of AI voice systems and their continuous improvement with the release of GPT-4o.

GPT-4o's enhanced voice and vision capabilities, being twice as fast and 50% cheaper than the previous model.

The upcoming release of a single model encompassing text, vision, and audio for AI voice systems.

AI voice systems' ability to respond with natural conversation timing and improved speech recognition.

The necessity of a pipeline involving three models for current AI voice assistants: speech to text, LLM, and text to speech.

Potential loss of information when using separate models for AI voice systems and the limitations it presents.

AI voice systems' capability to act on information and execute functions autonomously.

Examples of functions AI voice assistants can perform, such as scheduling appointments and controlling smart home devices.

Challenges faced by restaurants that an AI Voice Assistant could help solve, like high labor costs and language barriers.

The importance of prompting in creating an accurate and reliable AI voice assistant.

The process of testing AI voice assistants systematically using a set of example customer orders.

Using a developer platform like bar.a to set up an AI system with GPT-4 and integrating with Google Sheets.

Details on setting up the AI voice assistant in the VPI dashboard, including model selection and configuration.

The creation of a function within the AI system to handle orders and the use of webhooks for data processing.

The use of Google Sheets for storing and analyzing the results of AI voice assistant interactions.

Observations from testing the AI voice assistant with 10 example customer orders and the performance evaluation.

The potential of the upcoming single end-to-end model for GPT-4o and its expected impact on AI voice assistance applications.

Encouragement for viewers to start building AI voice assistance applications now to adapt quickly to new AI technology.