OpenAI DevDay in 5 Minutes: 4 Major API Updates

Developers Digest

1 Oct 202405:06

Summary

TLDRThe video introduces four major updates from OpenAI's Dev Day: real-time API, vision fine-tuning, prompt caching, and model distillation. The real-time API allows developers to build applications with speech-to-speech and audio interactions via persistent websocket connections. Function calling enhances dynamic app interaction. Vision fine-tuning enables image customization, while prompt caching reduces costs by storing repeated inputs. Model distillation improves smaller models using outputs from larger ones, offering cost-effective and efficient solutions. These updates are designed to improve application performance and scalability.

Takeaways

🎤 The new real-time API allows developers to build applications that support natural speech-to-speech conversations, similar to ChatGPT's voice mode.
🔄 Real-time API supports text, audio, or both for input, leveraging a persistent websocket connection for seamless communication.
🖥️ Function calling is now supported, allowing the model to trigger actions in applications, enhancing the interactivity of web and mobile apps.
🎧 The new real-time API improves over the previous approach, eliminating the need for separate Whisper integration and reducing latency and loss of emotion.
💸 Pricing for the real-time API starts at $5 per million tokens for input and $20 per million tokens for output, with audio pricing being higher at $100 per million tokens for input and $200 for output.
📉 API prices are expected to decrease over time, similar to trends seen with previous OpenAI model releases.
🖼️ Fine-tuning for the image API is now available, allowing developers to build agents that can handle specific image-based tasks within various devices.
⚡ Prompt caching is introduced, reducing costs by allowing reuse of frequently passed context, priced at half the cost of normal inputs and outputs.
🧠 Model distillation enables fine-tuning smaller, cost-efficient models using outputs from larger models like GPT-4, optimizing performance for specific use cases.
📂 OpenAI has released an open-source repository to demonstrate real-time API usage and function calling, including server and client streaming examples.

Q & A

What is the real-time API introduced in the update?
-The real-time API allows developers to build applications that support natural speech-to-speech conversations with real-time audio input and output. It uses a persistent WebSocket connection for seamless communication between users and OpenAI models, enabling more advanced interactions similar to ChatGPT's voice mode.
How does the new real-time API differ from the previous approach?
-Previously, speech had to be passed into Whisper for transcription, and then the resulting text was fed into the model for inference, which led to loss of emotion and emphasis, along with latency. The new API removes these steps by allowing direct speech input and output, reducing latency and preserving emotion and emphasis in conversations.
What are the key pricing details for the real-time API?
-For text, the pricing is $5 per million tokens for input and $20 per million tokens for output. For audio, the cost is $100 per million tokens for input and $200 per million tokens for output, which equates to about 6 cents per minute of input and 24 cents per minute of output.
How does the real-time API support function calling?
-The real-time API supports function calling, meaning that when speaking to the model, the system can detect when a function needs to be invoked. This enables developers to create dynamic applications that can trigger actions like changing the UI or performing tasks in real-time.
What is prompt caching, and how does it benefit developers?
-Prompt caching allows developers to store and reuse the same context repeatedly instead of sending it to the API each time, reducing costs. It is priced at half the cost of inputs and outputs, which makes applications more efficient when handling repeated context.
What is model distillation, and how does it work?
-Model distillation involves fine-tuning smaller, cost-efficient models using the outputs of larger, more capable models. This allows developers to use smaller models for specific use cases where larger models might be too costly or slow, while still maintaining strong performance.
What are the advantages of fine-tuning the image API?
-By allowing developers to fine-tune the image API, OpenAI enables more personalized and specific use cases, such as agents on browsers, laptops, or mobile devices. This capability will likely expand the range of applications involving image recognition or generation.
What is the cost structure for prompt caching?
-Prompt caching is priced at half the cost of both input and output tokens. This makes it more economical for developers who frequently need to send the same context to the API, thereby reducing overall usage costs.
How can developers use the real-time API with WebSockets?
-Developers can use the real-time API through a persistent WebSocket connection, allowing for continuous streaming of audio data between the client and the server. This ensures low-latency communication and smooth real-time interaction for applications.
What examples are available to developers for using the real-time API?
-OpenAI has released an open-source repository with examples on how to use the real-time API, including its streaming capabilities and function-calling features. These examples can help developers understand how to integrate the API into their applications.