Building a Speech Transcription App Using Flask and OpenAI
Summary
TLDRDieses Video zeigt, wie man mit Flask und Open AI eine einfache App baut, die es Benutzern ermöglicht, ihre Stimme aufzunehmen, in Text umzuwandeln und die Transkription im Browser anzuzeigen. Der Prozess umfasst die Verwendung des Browser-Audio-Recording-APIs, das Senden des Audios an Open AI für die Transkription und die Anzeige des Ergebnisses. Zusätzlich wird ein Coaching-Programm erwähnt, das bei Bedarf für individuelle Projekthilfe angeboten wird.
Takeaways
- 😀 Das Video zeigt, wie man mit Flask und Open AI eine App baut, die die Aufnahme der Benutzerstimme und ihre Transkription ermöglicht.
- 🎙️ Die App veranschaulicht, wie man mit dem 'Record'-Button die Sprache des Benutzers aufzeichnet und an Open AI sendet.
- 📝 Open AI wandelt die aufgezeichnete Sprache in Text um, der dann im Browser angezeigt wird.
- 🛠️ Der Entwickler bietet auch persönliche Coaching-Programme für Python-, Flask- oder Django-Projekte an.
- 🔗 Interessenten können unter 'prettyprint.com coaching' nach weiteren Informationen suchen und sich für das Coaching anmelden.
- 📱 Der Browser-API 'getUserMedia' wird verwendet, um die Aufnahme der Sprache direkt aus dem Browser zu ermöglichen.
- 👍 Es wird eine Überprüfung implementiert, ob die 'getUserMedia'-Funktion im Browser verfügbar ist, um eine Fehlermeldung anzuzeigen, falls nicht.
- 🔴 Der 'Record'-Button wechselt seine Farbe, um anzuzeigen, ob die Aufnahme aktiv ist oder nicht.
- 🔁 Der aufgezeichnete Audio-Stream wird in 'Chunks' unterteilt, die später zu einem Blob zusammengefasst werden, bevor sie an Open AI gesendet werden.
- 📤 Die App verwendet JavaScripts 'fetch'-Methode, um die Audio-Daten an die Flask-App zu senden, die dann an Open AI weiterleitet.
- 📥 Open AIs API wandelt die empfangenen Audio-Daten in eine Transkription um, die von der Flask-App zurück an den Browser gesendet wird.
- 🖥️ Der Flask-Server ist für die Verarbeitung der Anfrage verantwortlich, einschließlich der Verwendung des Open AI-Modells 'whisper-1' für die Transkription.
- 📝 Die finale Transkription wird im Browser angezeigt, wobei der Benutzer die Möglichkeit hat, die Aufnahme zu wiederholen und die Transkription zu aktualisieren.
Q & A
Was zeigt der Autor in dem Video?
-Der Autor zeigt, wie man mit Flask und Open AI eine einfache App baut, die es Benutzern erlaubt, ihre Stimme aufzunehmen, diese in Text umzuwandeln und den Text im Browser anzuzeigen.
Was ist das Ziel des Projekts, das im Video vorgestellt wird?
-Das Ziel des Projekts ist es, die Fähigkeiten der Open AI API zu demonstrieren, insbesondere die Funktion zur Spracherkennung und -umwandlung in Text.
Welche Technologien werden im Video verwendet?
-Im Video werden Flask, Open AI, JavaScript und Browser-APIs verwendet, um die Spracherfassung und -umwandlung zu ermöglichen.
Was ist der Zweck von 'getUserMedia' in diesem Kontext?
-'getUserMedia' ist eine Browser-API, die verwendet wird, um den Benutzer aufzufordern, die Erlaubnis zur Aufnahme von Audio zu erteilen und einen Aufnahmegerät wie ein Mikrofon auszuwählen.
Wie wird die Aufnahme gestoppt und an die Open AI API gesendet?
-Die Aufnahme wird gestoppt, indem der 'stop'-Button betätigt wird. Die aufgezeichneten Audio-Chunks werden zu einem Blob zusammengefasst und dann an die Open AI API gesendet, um die Spracherkennung durchzuführen.
Welche Rolle spielt JavaScript im Projekt?
-JavaScript wird verwendet, um die Benutzeroberfläche zu steuern, die Aufnahme zu verwalten, die Audio-Chunks zu sammeln und sie an die Flask-App zu senden.
Was ist der Zweck der 'mediaRecorder' Variable im Code?
-Die 'mediaRecorder' Variable ist ein Medienrekorder, der verwendet wird, um die Benutzerstimme aufzunehmen und die aufgezeichneten Daten in Form von Audio-Chunks zu speichern.
Wie wird die Transkription der Sprache aus dem Audio erreicht?
-Die Transkription wird durch den Aufruf der Open AI API mit dem Modell 'whisper-1' erreicht, das die aufgezeichneten Audio-Daten empfängt und sie in Text umwandelt.
Was passiert, wenn der Benutzer die Erlaubnis zur Mikrofonnutzung verweigert?
-Wenn der Benutzer die Erlaubnis verweigert, kann die App das Mikrofon nicht nutzen, um die Sprache aufzunehmen, und der Benutzer erhält eine Fehlermeldung.
Wie kann man die Transkription im Browser anzeigen?
-Die Transkription kann im Browser angezeigt werden, indem sie in das HTML-Dokument eingebettet wird, sobald sie von der Open AI API zurückgegeben wird.
Was ist der Zweck des 'coaching program', das im Video erwähnt wird?
-Das 'coaching program' ist ein Programm, bei dem der Autor bei Python-, Flask- oder Django-Projekten Einzelpersonen berät und unterstützt, indem er gemeinsam mit ihnen an ihrem Code arbeitet und Probleme löst.
Outlines
😀 Einführung in die App-Entwicklung mit Flask und Open AI
Der erste Absatz stellt das Videothema vor: die Entwicklung einer einfachen App mit Flask und Open AI, die es ermöglicht, die Stimme eines Benutzers aufzunehmen, zu transkribieren und den Text im Browser anzuzeigen. Der Ersteller demonstriert die Funktionsweise und bietet an, bei Projekten mit Python, Flask oder Django persönliche Unterstützung durch ein Coaching-Programm anzubieten. Zudem wird eine einfache Flask-App mit einer Route, einem Aufnahme-Button und einem Bereich für die Transkriptionsausgabe vorgestellt.
🔧 Aufbau der Aufnahmefunktion mit JavaScript und Browser-API
Der zweite Absatz beschreibt den Prozess der Implementierung der Aufnahmefunktion in JavaScript. Dazu gehört das Sicherstellen der Verfügbarkeit der 'getUserMedia'-Funktion im Browser, das Festlegen von Callback-Funktionen für Erfolgs- und Fehlerfälle, das Anfordern der Aufnahmeberechtigung durch den Benutzer und das Erstellen eines 'MediaRecorder'-Objekts. Zudem wird erläutert, wie man die Farbe des Aufnahme-Buttons mit Bootstrap anpasst, je nachdem ob die Aufnahme gestartet oder gestoppt wurde.
🎙️ Verwaltung der Audio-Aufnahme und -Transkription
Der dritte Absatz konzentriert sich auf die Verwaltung der Audio-Aufnahme. Es wird erklärt, wie man die aufgenommenen Audio-Daten in 'Chunks' verarbeitet, diese in ein Array speichert und nach dem Stoppen der Aufnahme zu einem 'Blob' zusammenfasst. Anschließend wird die Verwendung von 'FormData' beschrieben, um die Audio-Daten an die Flask-App zu senden, und wie man mit JavaScripts 'fetch'-Funktion eine POST-Anfrage an die Server-Route 'transcribe' richtet.
📡 Integration der Open AI API zur Spracherkennung
Der vierte Absatz behandelt die Integration der Open AI API in die Flask-App zur Umwandlung des Audio-Blobs in eine Text-Transkription. Dazu gehört das Einrichten der 'requests'-Bibliothek, das Konvertieren des Audio-Blobs in ein 'BytesIO'-Objekt, das Senden des Audios an die Open AI API und das Empfangen der Transkriptions-Antwort. Es wird auch auf die Notwendigkeit hingewiesen, den Open AI Client mit einem API-Schlüssel zu konfigurieren.
🖥️ Anzeige der Transkription im Browser und Projektabschluss
Der fünfte und letzte Absatz beschreibt, wie die erhaltene Transkription in der Webanwendung angezeigt wird. Der Ersteller zeigt, wie man die Transkription direkt im Browser darstellt, indem man den Inhalt in ein HTML-Element einfügt. Schließlich werden die Ergebnisse des Projekts demonstriert, wobei die Transkription des aufgezeichneten Textes angezeigt wird, sobald die Aufnahme gestoppt wird. Der Ersteller beendet das Video mit einem Aufruf an die Zuschauer, Fragen zu stellen, das Video zu liken und das Kanal abonnieren.
Mindmap
Keywords
💡Flask
💡OpenAI
💡Voice Recording
💡Transcription
💡getUserMedia
💡Media Recorder API
💡JavaScript
💡Fetch API
💡WebM
💡API Key
💡Whisper
Highlights
Heute zeigen wir, wie man mit Flask und Open AI eine App baut, die es ermöglicht, die Stimme eines Benutzers aufzunehmen, zu transkribieren und den Text anzuzeigen.
Die App nutzt den Browser-API 'getUserMedia', um die Stimme des Benutzers aufzuzeichnen.
Es wird eine JavaScript-Datei 'app.js' erstellt, um die Aufnahme-Funktionalität zu implementieren.
Es wird eine Überprüfung implementiert, ob die 'getUserMedia'-Funktion im Browser verfügbar ist.
Die App fragt den Benutzer um Erlaubnis für die Mikrofonnutzung und die Auswahl des Aufnahmegeräts.
Die Medienaufnahme wird gestartet und gestoppt, indem der 'Record'-Button gedrückt wird.
Die aufgenommenen Audio-Daten werden in 'Chunks' gespeichert und später zu einem 'Blob' zusammengefasst.
Die App verwendet 'fetch', um die Audio-Daten an die Flask-App zu senden.
In der Flask-App wird eine Route 'transcribe' erstellt, um die Audio-Daten zu empfangen.
Die Audio-Daten werden in ein 'BytesIO'-Objekt umgewandelt, um sie an die Open AI API zu senden.
Die Open AI API wird verwendet, um die Audio-Daten in Text zu transkribieren.
Die Transkriptions-Ausgabe wird aus der Open AI API extrahiert und in der App angezeigt.
Die App ermöglicht es dem Benutzer, die Transkription sofort nach dem Stoppen der Aufnahme zu sehen.
Die App kann mehrere Aufnahmen durchführen und die Transkriptionen werden im Browser aktualisiert.
Es wird eine Anleitung gegeben, wie man die Open AI API installiert und konfiguriert.
Das Projekt demonstriert die Verwendung der Open AI API für die Spracherkennung und -transkription.
Es wird ein Coaching-Programm angeboten, um bei Python-, Flask- oder Django-Projekten Unterstützung zu bieten.
Die finale App zeigt, wie einfach es ist, die Open AI API in eine Webanwendung zu integrieren.
Transcripts
hey everyone in today's video I'll show
you how to build a simple app with flask
and open AI that allows you to record a
user's voice and transcribe that voice
and then display the text so as an
example I'll hit record here and I'll
just start talking so everything that
I'm saying here as the record button is
read will be sent over to open Ai and it
will determine what I said converted to
text and then I will display it here in
the browser so let me hit the button
again and wait just a second and we see
here
I have the transcription of everything
that I just said so I've transcribed the
uh audio that I just sent and we have it
displayed here so that's what I'm going
to demonstrate in this video it's a a
simple little project with flask and
openi just to show you what you can do
with the open AI API uh but before we
get into that I just want to say that if
you need one-on-one help with any of
your projects so your python projects
flask Jango whatever I do have something
called the coaching program available
where I work with people oneon-one we
have zoom calls where I can look at your
code and work through any problems that
you have or help you just write code for
your app so if you're interested in that
just go to pretty print.com coaching and
I will uh be available to help you with
that just read the instructions on how
to fill out the form and I'll will get
in contact with you so with that said
let's get into building this project in
flask and open the eye and and also
there's a bit of JavaScript involved as
well okay so to get started with this
this is what I have to begin with it's
just a very simple flas cap that has one
route so far inside that one route it
has a record button and an area where
it's going to display the output of the
transcription so this is what it looks
like right now you just see the record
button but when I press it it doesn't do
anything I'm also adding this app.js
file here so this is where I'll write
all the JavaScript for this little
project so I'll start writing it now
because I want to set up the
functionality where I can actually
record my voice first and then once I
can record my voice I'll then take the
audio and pass it to open AI to get the
transcription and then I can insert it
into the page so here I have these two
constants record and output uh those are
just the elements on the page so I can
uh know when the record button was
pressed and I know where to put the uh
output transcription so the first thing
I need to do is I need to set up the
ability to record so this is is going to
be using a browser API so there's no
custom code that needs to be written to
record your voice you just need to use
the API that comes with your browser one
of the mini apis that comes with your
browser so uh to do that uh the first
thing I need to do is I need to make
sure that the user actually has this
available and pretty much everyone in
2024 and going forward should have this
on their browsers but just in case they
don't uh you want to put a check in your
code to make sure that they actually
have the functionality so to do that uh
what you do is you see if the function
that you need to use first so this
function is called get user media you
see if it exists in the browser if it
does then you can go forward with
actually setting everything up if it
doesn't exist then you can give them
like some error message so the API is
going to be under
Navigator medad
devices. user media right so if this
exists if this function exists that
means that uh you can record audio from
their browser if it doesn't exist then
you can't and you just have to give them
an error message so uh what I'll do is
I'll put a comment here and
then after the brackets here I'll put an
else and I'll just uh
alert um you know get user media uh not
supported in your browser okay so let's
say it's not supported just like that so
if this function doesn't exist like I
said then we can't really do anything um
with this particular project or you just
need an alternate way of getting their
voice but you can't use this but if they
do then you can continue with the code
here uh so the way that this works is
once you call this function gate user
media it's going to display the uh
permissions popup in the browser asking
the user to give permission uh to record
audio and also they'll have to select a
recording device as well so if they have
like multiple microphones connected to
their computer like I do right now then
they'll have to select the one that they
want to use to uh record so that's what
happens when you call this function and
when you call that function uh that
would either be successful or a failure
so if they give you permission and they
select a device it's going to succeed
and then uh you can get the sound from
that device and if they say no or
there's just some other error then um
you know you can't do anything because
they're not giving you permission to uh
use their
microphone so for that I need to Define
these two callback functions so let me
Define those here first uh I'm going to
use the approach of defining functions
like this um so it's going to be a
callback function so it's going to start
with on that's just like a little
convention um but let's say on media
setup or on media start how about that
uh on media start I don't even like that
name um on media setup success okay this
is a good name so on media setup success
I want to call this function when um the
user like gives permission so um like I
said I'm using this style of functions
simply because um the API that I'm using
uh I need to Define some functions for
things that happen when like um you know
you
you stop the media recording or like
sound comes in and this is the easiest
syntax so I just want to keep it
consistent so I'm going to use it for
the callbacks as well so uh this is
going to take in some stream this is the
basically the microphone the source of
the audio so I'll just put that there
and you know what I'll put an alert so
um you know everything is working just
so we can
see and then I'll call this on media
setup failure right and this function
doesn't have to take anything but you
can have an error in here an error
message I'll just put error like that
and then I'll just alert whatever the
error is okay so this should be enough
for the Callback functions for now and
then once I have those callback
functions I want to actually use them um
in G user media so I'm going to call
Navigator media devices
G user
media and then I want to pass in a
dictionary or since this is Javascript
um an object that just tells the browser
what kind of media I'm interested in
getting so in this case I just want
audio I don't want like video or
anything so I just put audio true here
and then after this I can just call then
and I pass in the two callbacks so on
media
success and on media setup failure okay
so it's on media success not on media
setup success there we go so that should
be enough so let me go back over to the
browser and refresh and now we see as
soon as I refresh like this popup comes
up where I can select an audio source so
I have two microphones set up and I can
either select allow or block I'll select
allow and we see everything is working
and then we see the icon here that just
shows that I've granted permission to
this website and every time I refresh
like uh it doesn't have to ask for
permission because it still remembers
that on this particular website I've
given permission so uh if I close my
browser reopen it and go back to the
same page then I'll have to Grant
permission again so okay so now I have
this ready and what I want to do next is
I want to set up the ability to actually
take my voice record my voice and then
send it somewhere so in the on media
setup success function that's where I'll
put like all the code for this uh so the
first thing I need to do is I need to
grab the actual um stream and create a
media recorder out of it so I'll do cons
media
recorder and this is going to be equal
to new media recorder so I'm
instantiating the media recorder so this
is from the JavaScript API the browser
API
and let me make that a capital
R and this should be const okay so I'm
instantiating this and then I'll use
this media recorder for everything so
now what I want to consider is what
happens when I start recording so I have
this um record element here this value
and I'll just do unclick like I said I
don't usually use this style of writing
functions but because I'm using the
media recorder I think it makes more
sense so record. on click
um we have the function here and when I
do this I want to check the status of
the media recorder so if the media
recorder is currently recording I'm
basically going to stop the media
recorder and if it isn't recording then
I want to start it so I can check the
state of the media recorder by doing
media recorder. State and this will be
equal to recording so if it's recording
then I'm going to stop it so I'll do
media recorder. stop
and then I have some bootstrap stuff um
basically I want to change the color of
the record button so I can do record.
class list. remove I want to remove the
button danger right so when I start the
recording I want to turn the button to
be red and when I stop I want to turn it
back to the original blue that it is
here so for that I have to remove the
danger class danger is the red and I
need to add primary so classless uh. add
and then button primary and bootstrap
and then this will be the opposite uh in
this else here so this else represents
um the state that is not recording
meaning that you haven't started
recording yet so here I'm going to start
the media recorder so media recorder.
start not starter but start and then I
want to add and remove the correct
classes so I'm going to remove button
primary this time remove the blue and
then I'm going to add the button danger
so here when the user clicks on the
record button if it's already recording
it stops changes the colors if it isn't
recording it starts the recording and it
changes the colors as well okay so the
idea is when you are recording audio
from this um the user's Voice or
whatever sound is coming is it's going
to be coming in in chunks right so um
it's not like going to be a continuous
thing that has the audio instead it's
going to be broken up into little pieces
so you can take those little pieces and
put them into an array and then you can
take that entire array and take that as
the recording once the user hits stop so
the way you can do this is you can first
uh initialize like a list so we'll say
uh we'll call this chunks or initialize
an array in JavaScript and then down
here I have a media
quarter and then I need to define the on
data available function right so I'm
adding a function
here and E is going to be you know the
stuff that is available and I simply
want to push the data from E like the
event of data being available into the
chunk so I'll do chunks. push e
data right so let's say like each chunk
is 1 second so if I record for 10
seconds then I'll have 10 items in this
array when I'm done when I hit the the
stop button or when I hit the record
button again to stop it so once I have
all the chunks after I stop it I need to
do something with it so first I need to
know when I've stopped so I can set up
another function here so media recorder
on
stop so when I stop I can do something
and what I want to do is I want to
create a new object object that holds
all of the information from the chunks
so like I said I set up the chunks
because the data is going to come into
chunks but once I stop the recording I
want to take all those chunks and create
like one thing and in JavaScript that's
called a
blob so what I'll do is I'll say uh let
blob so this will be a variable uh equal
new blob so I'm instantiating a blob and
I'm just going to pass the chunks so
chunks here and I need to give it a file
type right so the type type here is
going to be
audio audio webm which is the default
type for audio when you're recording
like this and it's completely fine for
our purposes if you wanted to change
this you could but audio webm is
completely
fine so once I do that I can like clear
out the chunks so I'll just say chunks
equals that so I can record multiple
times and then once I have this blob I
can go ahead and send it over to my
flask app so flask is about to come into
play again uh so what I want to do is I
want to create some form data here and
I'm going to add it to the form right so
it's not like a regular HTML form but
for me this is the most convenient way
of getting the blob over to flask so I'm
going to say form data append and I'll
add a form element called audio and I'm
passing the blob to that so now that I
have this form data and I've attached
The Blob to it I need to send that over
to my flas app so I can use fetch for
that
and we'll create an inpoint called
transcribe and then here we'll set up
the things here so the method is going
to be
post and then the body is going to be
the form
data right so post requests using the
form data and the form data only has the
blob in it when this returns properly I
want to call this then and then let's
say
response and I want to convert that to
Json so I'm going to return my uh
transcription in Json so I'm just
converting it to Json and then I'll add
another
then and this will have the
data that's actually in my Json and for
now I'll just console log it so console
log data or really I can just alert it I
think that would be fine as well all
right
so just to recap um I set up the
functionality where I can change the
color of the button and stop or start
the media recorder I have this function
here that will be called every time
there's data available from the stream
so it just gets pushed onto this array
called chunks and then when I stop the
recording it will create a blob using
the chunks it will add that blob to some
form data and then I'll pass that form
data to fetch so I can send it over to
my flask app so now let me go over to my
flas app and work on that so I
have a new route here called transcribe
that I want to build so app route
transcribe I remember
the methods will just be post for this
and I'll call it
transcribe and what I want to do is I
want to import requests from flask
because I need to get the data from the
requests so I'll I'll just call this
file and I'll say requests. files and
then audio so audio is just the audio
that I put here in form data next I need
to convert that audio to something that
I can send to the open AI API so what
I'm going to do is I'm going to
import import IO and I'm going to use
that to create a bytes IO object and
I'll just call this buffer so buffer
equals io. bytes IO and I'll take the
file and read it in right so it's going
to read in the file data and create a
new bytes iio object from that and I'm
calling it buffer reason why I'm doing
that is because I need something I can
send to open a I can't send the file
itself directly from the form I need to
convert it to a form that uh I can send
to open AI so that's why I'm doing it
here and then I need to give it a name
the name doesn't really matter but I
noticed in testing that the open AI API
doesn't like when you send a buffer With
No Name so I'm just going to call this
audio
do
webm right that should be
fine so next I need to use the open AI
API so first I can pip install it so pip
install open
Ai and then I can go ahead and import
the main class so from open AI import
open Ai and I can set up the client so
client equals open Ai and I have my API
key in my environment so it's there you
don't see it but you have to set that up
before so you can either put it here as
like API key equals whatever or you can
put it in the environment like me under
an environment variable called open AI
API key so once you have that client
then you can use it for stuff so here um
I'm going to create a variable called
transcript so this is going to be the uh
output from open Ai and I'm going to use
the client that I just instantiated and
then I'll use the API call for
transcription so it's client.
audio.
transcriptions doc create and then you
need to give it a model uh the model
that I'll use is whisper one and then
the file so the file is just going to be
buffer right and from this I want to
return adjacent object uh we'll just say
output and I'll take the transcript and
grab the text so let's see if this works
let's go ahead and start the
app and let's go over and check it out
so I'll refresh the page and remember
the permission is still there so I'll
hit the record button and I'll just
start talking so record and I'm still
talking I'm using the microphone that
I'm recording the video with so
everything I'm saying now should be
recorded we see that the button is red
so now I'll stop it by clicking it again
and we see this popup this alert it has
object object the reason why it has
object object is because I'm trying to
console log or not console log but alert
um an object here so instead of just
data I'll do data. output because that's
what I put it under and I'll try it
again so let me just
refresh and I'll record again so now I'm
recording and now when I hit the stop
button or the record button again it's
going to alert the actual text so let's
see if that
happens so just wait a moment and we see
here the text that I was just saying as
I was reading so I'll record again and
blah blah blah you just heard it okay so
the last thing I want to do is I want to
Simply put that here so the user can see
it and that's very simple to do so
instead of alerting data. output I'll
just take the output element
here so output
enter HTML and I'll do data output just
like that so now let's try one more time
I'll refresh I'll hit the record button
here and I'll just talk a little bit and
we should see that when I hit the stop
button all the text is going to appear
here when everything is done so I'll hit
stop and we see the text appear here
this is everything that I just said and
of course if I wanted to record again
because I'm clearing out the chunks I
can so now when I uh hit the record
button again and it stops it should just
uh replace everything here with
everything that I said after I hit the
button again so I'll stop
it and we see that the text has just
changed so that's it for this little
project that I want to show you it's
just a simple way of using the the open
AI API but a lot of people are
interested in this API nowadays so um I
thought I'd make more videos this is a
second video with open AI so I'll
probably make more videos in the future
but I just want to show you like um a
little example of what you can do with
the uh open AI API so that's it for this
video if you have any questions about
anything that I've done here feel free
to leave a comment down below if you
like this video please give me a thumbs
up and if you haven't subscribed to my
channel already please subscribe so
thank you for watching and I will talk
to you next time
Voir Plus de Vidéos Connexes
5.0 / 5 (0 votes)