Speech To Text using ESP32

techiesms
8 Jul 202313:56

Summary

TLDRThis tutorial video guides viewers on creating a standalone voice assistant using the ESP32 board and Google's speech-to-text API. It covers setting up a Google Cloud account, obtaining an API key, and writing Arduino code to convert speech to text. The video also discusses using an MEMS microphone and ESP32 for audio input, and hints at an upcoming project involving integrating the speech-to-text functionality with a chatbot API for a complete voice-controlled assistant.

Takeaways

  • 😀 The video series on ESP32 has successfully demonstrated running a chatbot on an ESP32 board using chat APIs and listening to responses through a speaker.
  • 🔍 The audience requested a standalone voice assistant based on chat GPT, which can take questions through a microphone and provide answers via a speaker.
  • 🛠️ The presenter's team is working on creating a voice assistant and the first step is learning to convert speech to text using Google Cloud services.
  • 📈 The video is sponsored by LTM, promoting their product, Altium 365, an electronics product design platform that facilitates PCB design, data management, and teamwork.
  • 🔑 To convert speech to text, one must first obtain an API key from Google Cloud for the Speech-to-Text service, which involves creating a Google Cloud account and enabling the API.
  • 💳 The Google Cloud account creation process includes providing business details and credit card information for verification, although no charges are made initially.
  • 📝 The video provides a step-by-step guide on how to generate an API key, restrict its usage to the Speech-to-Text API, and integrate it into an Arduino code.
  • 🔍 The code for speech-to-text conversion is explained, including modifications needed for WiFi credentials and API key, and the original code is available for reference.
  • 🎙️ The hardware setup involves an ESP32 board and an MEMS microphone, with instructions provided for connecting these components.
  • 📝 The code logic is divided into microphone and Google Cloud parts, with the microphone capturing audio in a digital format and Google Cloud processing it to return text.
  • 🔄 The video mentions a limitation in the code regarding the maximum recording time for speech-to-text conversion, which is around 2.5 to 3 seconds, and seeks solutions from the audience.
  • 🔗 The presenter encourages the audience to subscribe for the upcoming video that will demonstrate the creation of a complete voice assistant based on chat GPT.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is about converting speech to text using Google Cloud services for a standalone voice assistant based on chat GPT on an ESP32 board.

  • What is the purpose of creating a standalone voice assistant?

    -The purpose is to enable users to ask questions directly to the device and listen to the answers through a speaker, without the need for manual text input.

  • Which platform is used for the speech to text conversion in the video?

    -Google Cloud services are used for the speech to text conversion.

  • What is the initial credit provided for a new Google Cloud account?

    -A new Google Cloud account is provided with an initial credit of up to 300 US dollars.

  • How long is the free trial period for the Google Cloud account?

    -The free trial period for the Google Cloud account is 90 days from the day the account is created.

  • What is the name of the product sponsored in the video?

    -The sponsored product is called 'LTM 365', an Electronics product design platform.

  • What does LTM 365 offer for PCB designing and project collaboration?

    -LTM 365 offers PCB designing, project sharing for review, centralized cloud storage, component management, real-time supply chain data, and the ability to send designs to manufacturing units.

  • What is the limitation of the speech to text conversion code presented in the video?

    -The limitation is the time constraint, where the speech needs to be completed within approximately 2.5 to 3 seconds for accurate conversion.

  • How can one access the free trial version of LTM 365 mentioned in the video?

    -The free trial version of LTM 365 can be accessed through the link provided in the description of the video.

  • What is the next step after learning speech to text conversion in the video series?

    -The next step is to create a complete voice assistant based on chat GPT, which will involve filtering the text, using chat GPT API, and converting the answers into speech using TTS service.

  • Where can the code for the speech to text conversion be found?

    -The code can be found on the presenter's GitHub repository, the link to which is provided in the video description.

Outlines

00:00

🤖 Building a Standalone Voice Assistant with ESP32

The script introduces a project to create a standalone voice assistant using ESP32, following previous successful integrations of chatbot functionalities. The tutorial will guide viewers on converting speech to text with Google Cloud Services, which is essential for the voice assistant project. The video is sponsored by LTM and their product, Altium 365, an electronics product design platform. The speaker provides a step-by-step guide on setting up a Google Cloud account, enabling the Speech-to-Text API, and generating an API key, which is crucial for the project's code.

05:01

🔧 Coding and Hardware Setup for Speech-to-Text Conversion

This paragraph details the coding process for speech-to-text conversion using the ESP32 and a MEMS microphone. The speaker provides the complete code, modified to fit the project's needs, and references the original code for further reference. The hardware setup involves the ESP32 and the microphone, with a connection diagram available on the speaker's website. The speaker explains the need to restrict the API key usage to the Speech-to-Text API only for security. The code explanation is simplified using a whiteboard, dividing the process into microphone and Google Cloud parts, detailing the audio signal processing and API request/response in JSON format.

10:03

📈 Demonstrating Speech-to-Text Conversion and Future Plans

The speaker demonstrates the speech-to-text conversion process by recording speech and showing the conversion results in the serial monitor. The script discusses the limitations of the current code, specifically the time constraint for recording, and invites viewers with solutions to contribute to the community. The video concludes with an invitation to subscribe for the upcoming project of creating a complete voice assistant based on chat GPT, which will involve text filtering, interaction with the chat GPT API, and text-to-speech conversion for final output through speakers.

Mindmap

Keywords

💡ESP32

ESP32 is a microcontroller with integrated Wi-Fi and Bluetooth capabilities. In the video, it is used as the central processing unit for the standalone voice assistant project, demonstrating its ability to handle tasks such as speech-to-text conversion and interfacing with microphones and speakers.

💡Chat GPT

Chat GPT seems to be a reference to a chatbot or AI assistant technology, possibly a variant of GPT (Generative Pre-trained Transformer). The script mentions using 'chat GPT' APIs for creating a voice assistant, indicating the use of advanced AI for natural language processing and response generation.

💡Speech-to-Text Conversion

This refers to the process of converting spoken language into written text. The video's main theme revolves around teaching viewers how to implement this functionality using Google Cloud services, which is essential for creating a voice assistant that can understand and process human speech.

💡Google Cloud Services

Google Cloud Services are a suite of cloud computing services offered by Google. The script discusses using Google's speech-to-text API, which is part of these services, to perform the conversion of speech into written text, highlighting its role in enabling advanced voice recognition capabilities.

💡API Key

An API key is a unique code used to authenticate requests to an API (Application Programming Interface). The script provides a step-by-step guide on how to generate an API key for Google's speech-to-text service, which is necessary for accessing and using the service within the project.

💡LTM 365

LTM 365 is mentioned as an electronics product design platform that integrates PCB design, MCad data management, and teamwork. The video is sponsored by LTM 365, and it is highlighted for its features that support the design and manufacturing process of electronic products.

💡Microphone

A microphone is a device used to convert sound waves into electrical signals. In the context of the video, a microphone is essential hardware for capturing speech input for the speech-to-text conversion process, which is then used by the ESP32 to interact with the voice assistant.

💡Arduino IDE

Arduino IDE is an integrated development environment used for writing and uploading code to Arduino-compatible hardware like the ESP32. The script mentions using the Arduino IDE to write and upload the code necessary for the speech-to-text functionality, demonstrating the development process for the voice assistant.

💡JSON

JSON (JavaScript Object Notation) is a lightweight data interchange format that is easy for humans to read and write, and for machines to parse and generate. The script discusses the use of JSON format for the body of the API request and response in the speech-to-text conversion process, illustrating how data is structured and exchanged.

💡TTS (Text-to-Speech)

TTS refers to the technology that converts written text into audible speech. Although not explicitly detailed in the script, the mention of converting answers into speech suggests the use of TTS services to enable the voice assistant to 'speak' responses back to the user.

💡Language Code

A language code is a unique identifier used to specify languages in computing and internet-related protocols. The script explains the importance of setting the correct language code in the speech-to-text API request to ensure accurate speech recognition, with 'Indian English' being used as an example in the video.

Highlights

Successfully running chat, Deputy on an ESP32 board using chat, GPD APIs.

Listening to answers from chat GPD through a speaker attached to the ESP32.

Introduction of a project to create a standalone Voice Assistant based on chat GBD.

Attaching a microphone to the ESP32 for direct voice interaction with the Voice Assistant.

Learning to convert speech to text using Google Cloud services for voice input in various projects.

Sponsorship by LTM and introduction of their product, ldm 365, an Electronics product design platform.

Guide on creating a Google Cloud account and enabling the API key for speech to text conversion.

Details on getting a free account with a $300 credit for Google Cloud services.

Instructions on restricting the API key usage to only the speech to text API.

Arduino IDE setup for writing code to convert speech to text.

Hardware requirements for the project: ESP32 with David V1 and an MEMS microphone.

Code modification according to personal needs and original code availability.

Explanation of the code's working mechanism using a whiteboard.

Demonstration of the speech to text conversion process with live examples.

Accuracy of the speech recognition and the confidence level of the results.

Limitation of the recording time to 2.5 to 3 seconds for the current code setup.

Invitation for community input on extending the recording time limit for better user experience.

Sharing of the code on GitHub for community access and further development.

Upcoming project预告: creating a complete Voice Assistant based on chat GPT with text-to-speech capabilities.

Transcripts

play00:00

so till now in the esp32 chat GPD series

play00:02

we were successfully able to run chat

play00:04

Deputy on our esp3 board using the chat

play00:06

GPD apis and then we were also

play00:08

successfully able to listen to the

play00:10

answers coming from the chat GPD using

play00:13

the speaker attached to the esp32 so

play00:15

after those two videos many of you

play00:16

people asked me to make a standalone

play00:19

Voice Assistant based on chat gbd in

play00:21

which we can attach a microphone with

play00:23

the help of which we can ask the

play00:24

questions directly to it and we can

play00:26

listen to the answers with the help of

play00:28

the speaker attached to it well

play00:30

definitely this is a very interesting

play00:31

project to be made hence my team started

play00:34

working on it and now before I teach you

play00:36

about how to make that Standalone Voice

play00:39

Assistant based on chat GPD you first

play00:41

need to learn one more thing before

play00:43

moving on to the last project which is

play00:45

converting our speech to text so in this

play00:48

video I'll be guiding you completely

play00:50

about how to convert your speech to text

play00:53

using the Google cloud services and this

play00:55

will not only help you to make that chat

play00:57

GPT project but this learning will help

play01:01

you in multiple of your projects where

play01:03

you want the voice input the speech

play01:06

input converted into text for further

play01:08

processing so this is a very useful

play01:11

topic to be learned so stick around with

play01:13

this video as I'll be covering

play01:15

everything about it let's get started

play01:18

this video is sponsored by LTM and they

play01:21

came up with an amazing product called

play01:23

as ldm 365. so ldm 365 is an Electronics

play01:26

product design platform that unites PCB

play01:29

design mcad data management and teamwork

play01:33

so with LTM 365 you can do the PCB

play01:36

designing task you can share your

play01:38

projects over web for review purposes it

play01:41

do cover sharing your PCB file to

play01:43

Mechanical team so that they can create

play01:45

the mechanical product package based on

play01:47

your PCB then it also provides the

play01:49

centralized cloud storage so you don't

play01:51

need to rely on one single computer for

play01:53

your files it also helps you with

play01:55

managing your components and get

play01:57

real-time supply chain data for your

play01:59

components

play02:00

it also allows multiple people to work

play02:03

on single project and in the end it also

play02:06

helps you with sending your design to

play02:07

final manufacturing units so Altium 365

play02:11

takes care of all other tasks so you put

play02:13

more time and effort in making something

play02:15

creative and useful and the good part is

play02:17

you can try out it free version as well

play02:20

I'll leave its free trial version link

play02:22

down in the description of this video so

play02:23

do check that out and now let's start

play02:26

with this video

play02:27

so now the first step for converting

play02:29

speech to text is to get the API key for

play02:32

Google Cloud so now let me guide you how

play02:34

to make the Google Cloud account and how

play02:36

to enable the API key for speech to text

play02:39

conversion so now to get the Google API

play02:41

for speech to text you first need to go

play02:43

to cloud.google.com forward slash speech

play02:46

to text I'll be linking with this link

play02:48

down in the description of this video

play02:49

and here you need to log in with your

play02:51

Google account after that you can click

play02:52

on start free button so initially it

play02:55

will be a free account in which you'll

play02:56

be getting a credit up to 300 US dollars

play03:00

so here on the screen you can see 300

play03:02

credit for free and it will work for 90

play03:04

days from uh the day you make the

play03:07

account like it is from today okay so

play03:09

I'll select my country so here you need

play03:11

to select what kind of organization we

play03:12

have or what kind of uh like what's the

play03:14

need of this Google API so I'll select

play03:16

other here click on terms and services

play03:19

as Okay click on continue it's asking

play03:21

for business name I'll write this write

play03:23

as techy SMS so here it is asking for

play03:25

the card number so you need to provide

play03:27

your card details so it won't be

play03:29

charging any of the amount initially

play03:31

okay as we are getting 300 for free but

play03:34

after that when you use their services

play03:35

they will start charging uh it according

play03:38

to the use cases okay so let me type out

play03:40

team card number it is asking for the

play03:42

CVV number I'll provide that click on

play03:44

continue once again so yeah it will be

play03:46

charging a little amount of rupees 2

play03:49

which will be graded once uh they verify

play03:52

your account okay so I'll wait for the

play03:54

OTP so here is the OTP and yeah uh they

play03:57

verified my account and uh great so what

play03:59

brought you to the Google Cloud so how

play04:02

we came to know about this Google Cloud

play04:03

that is what it is asking about so let's

play04:07

select uh learn more explore click on

play04:09

next what are you what you're interested

play04:11

in doing with Google clouds okay so

play04:13

you're interested in artificial uh

play04:14

intelligence machine learning we are

play04:16

interested in the apis and that's it

play04:18

click on next what best describes your

play04:20

role so I am an educator if okay I am an

play04:24

educator so I'll select this and click

play04:25

on done with this we have successfully

play04:26

created a account for on Google Cloud

play04:30

for using the text or speech to text

play04:32

Services okay but we are not done yet we

play04:34

need to create create an API and for

play04:36

creating the API you can click on this

play04:37

convert speech to text API option and

play04:40

here you can click on enable this API

play04:43

and now when you go to the credentials

play04:45

section you'll be landing up on this

play04:47

page now here you can create your new

play04:48

API key for that you just need to click

play04:50

on create credential click on API key it

play04:53

is creating your unique API key so

play04:55

here's the API key I'll copy it because

play04:57

we need to paste it in our Arduino R

play04:59

code Okay click on the close button and

play05:01

first you need to go to this API key

play05:03

option and here we need to uh provide

play05:05

that we will be restricting this key to

play05:08

just be utilized for speech to text API

play05:12

only okay so you'll be using this just

play05:13

for speech to text and nothing else

play05:15

click on the save button

play05:18

and that's it we successfully generated

play05:20

the API key and now we are ready to

play05:22

provide this key Insider code so after

play05:24

learning about how to generate the API

play05:26

key now let's jump on to the Arduino IDE

play05:28

and let's understand uh how to write the

play05:30

code for converting speech to text so

play05:32

here is the complete code for converting

play05:33

your speech to text so I modified the

play05:35

code according to my need while the

play05:37

original codes link I'll attach in the

play05:39

description of this video okay and if I

play05:40

talk about the hardware part of this

play05:42

then I'm using the esp3 to do with David

play05:44

V1 and an mems microphone which are

play05:46

connected according to this connection

play05:47

diagram

play05:49

well both the Hardwares are available on

play05:51

our website whose purchase link is down

play05:52

in the description of this video so now

play05:54

if I tell you what changes I you need to

play05:56

make in this code then you just need to

play05:57

go to network underscore parameter dot

play05:59

as header file and you need to provide

play06:01

Sid name and password of your WiFi

play06:03

router after that this will remain as it

play06:05

is no change in this no change in this

play06:07

as well then you need to just change

play06:09

this API key and how to generate the API

play06:12

key that we already discussed in the

play06:14

previous part okay so that's the only

play06:16

change you to do in this code and the

play06:18

rest of the code will remain as it is

play06:20

now explaining this code will be really

play06:22

very difficult uh for me and it will be

play06:25

very confusing for you to understand as

play06:27

well so let me explain the working of

play06:30

this code uh using the Whiteboard so now

play06:33

the code is divided into two part one is

play06:35

the microphone part and another is the

play06:36

Google Cloud part so in the microphone

play06:38

part we do have only one microphone one

play06:40

esp32 in which we are giving the audio

play06:43

signal to the microphone which is given

play06:45

to the esp32 which is stored in a linear

play06:48

6 16 format okay so it's a linear 16.

play06:52

it's the encoding format just like dot

play06:54

MP3 that we are using okay so this is a

play06:57

16-bit data so here basically what we

play06:59

are doing is we are giving the analog

play07:01

signal and storing it in a 16-bit format

play07:03

like a digital format after doing that

play07:05

the microphone part is cleared now comes

play07:07

the part of the Google Cloud so in the

play07:09

Google Cloud we are using the API or

play07:11

whose API key we already generated okay

play07:13

so the API will be requested onto this

play07:16

host which is speech.googleaps.com and

play07:19

we'll be attaching the API key inner

play07:21

headers and the main part here is the

play07:23

body of the API so this is the body of

play07:26

the API which we are sending in the code

play07:28

itself so the body is in the Json format

play07:31

of course and we are getting the

play07:32

response as well in the Json format so

play07:35

what is inside the body let's just uh

play07:36

discuss okay so first we have the audio

play07:39

key value pair in which we are providing

play07:41

the content which is nothing but the

play07:43

audio file now this audio file is this

play07:46

Digital Data that we have stored okay so

play07:48

vs sending the digital format or Digital

play07:51

Data which is nothing but our own audio

play07:54

file into this content key value pair

play07:57

after that inside the config part what

play08:00

we are doing is we are providing the

play08:01

configuration of this audio file like

play08:03

what is its encoding method so it is

play08:05

linear 16 what it is sample Hertz so it

play08:07

is 16 000 Hertz so it is sampled at 16

play08:11

000 Hertz and then the language code so

play08:13

which language is used in this speech so

play08:16

in my case I have used the Indian

play08:18

English language code you can change the

play08:21

language code in case you are using any

play08:23

other language for this speech to text

play08:24

conversion you can find this kind of

play08:26

language code in the Google itself okay

play08:29

so this much data along with the API key

play08:32

we are sending it to this host and once

play08:36

we send it the Google Cloud will you

play08:38

know analyze that particular data which

play08:40

is this and give us back the text format

play08:43

of the speech and how accurate it is

play08:46

okay so this kind of data will be

play08:48

getting in response us so this logic is

play08:51

embedded inside that code I hope now you

play08:54

understood the logic

play08:56

so that's how the speech to text code

play08:58

works and in case you want to change

play09:00

couple of parameters discussed in the

play09:01

Whiteboard then here is that complete

play09:04

body of that HTTP request under Cloud

play09:07

speech client dot C plus plus file okay

play09:09

so here's a linear 16 encoding file the

play09:11

sample Hertz and the language so in case

play09:13

you want to change the language for

play09:15

speech to text you can change the

play09:16

language code here and rest of the code

play09:19

will remain as it is so uh now to upload

play09:22

this code there is one single change

play09:24

they need to do which is first you need

play09:26

to go to your tools then to boards then

play09:28

into boards manager and here the search

play09:30

for esp32 as of now the current version

play09:33

I have installed is 2.0.9 but this code

play09:36

will work only in esp32 boards package

play09:39

version

play09:41

1.0.6 so you need to downgrade this

play09:44

package so I'll select 1.0.6 and click

play09:46

on the install button if you don't do

play09:48

that it will show a couple of compiling

play09:51

errors so make sure you download it

play09:53

before compiling and uploading Okay so

play09:56

successfully install 1.0.6 version I'll

play09:58

click on the close button and here I'll

play10:00

select the right board which is esp.okt

play10:02

or V1 right com port and I'll straight

play10:04

away hit the upload button

play10:07

okay so the code is successfully

play10:08

uploaded I'll open the serial monitor

play10:10

and

play10:12

okay so it says recording completed now

play10:14

processing no problem so I'll press the

play10:15

reset button and as soon as I press the

play10:17

reset button the CL monitor will say

play10:19

record start and after that we need to

play10:21

you know speak anything that we want to

play10:24

convert in text format let me show you a

play10:26

quick demo so I'll press the reset

play10:27

button and

play10:29

hello my name is Sachin

play10:32

okay says recording completed now

play10:34

processing let's just wait for the

play10:35

result

play10:40

okay so here is the Json formatted data

play10:43

that we got and here is the text format

play10:45

of my speech which is hello my name is

play10:48

Sachin and it recognized the word

play10:51

because Sachin is an Indian a name okay

play10:54

so as I put the uh language as Indian

play10:56

English it recognized my name completely

play10:59

right correct and the confidence level

play11:01

is 0.92 okay so it is 92.99 accurate

play11:07

okay so it is a very good number and we

play11:09

got the result very accurate as well so

play11:11

this is what I spoke let's just try it

play11:13

once again so I'll press the reset

play11:15

button and

play11:17

hello this is Sachin how are you

play11:21

let's wait

play11:28

and once again hello this is Sachin how

play11:30

are you okay so we got the exact text

play11:33

format uh of our speech now here the

play11:36

recording time as of now and now is

play11:38

around 2.5 to 3 seconds so within three

play11:41

seconds you need to complete your

play11:42

statement now I try to increase this

play11:45

time but uh when I increase it I was

play11:48

getting a lot of uh like errors in the

play11:50

code okay it is it was getting uploaded

play11:52

but uh it was not working okay so

play11:55

maximum I got off around 2.5 to 3

play11:58

seconds and in this you need to say

play12:00

whatever you want to like say or you can

play12:02

give commands in three second quite easy

play12:04

like turn on the light turn off the

play12:05

light any command that you want to give

play12:07

three second is more than enough time

play12:09

for that okay so yeah this is how you

play12:11

can convert any speech into text using

play12:14

this code now there is only one single

play12:15

issue in this code which is the time

play12:17

boundation so any one of you watching

play12:19

this video have any experience in like

play12:21

converting speech to text using the

play12:24

esp32 and if you know the solution about

play12:26

how I can increase the time let limit

play12:28

then do reach out to us via the comment

play12:31

section and that will help me and also

play12:33

to the community as I will be sharing

play12:35

that code with all of you and as usual I

play12:37

am sharing this code as well through my

play12:39

GitHub repository host link you can find

play12:41

in the description of this video and

play12:43

yeah this was that last thing which you

play12:45

need to learn before making our own

play12:48

voice assistant based on chat GPT and

play12:50

now we are left with the last video of

play12:53

this whole series in which we'll be uh

play12:55

filtering out the text coming from this

play12:57

code will be giving that text data to

play13:00

chat GPT API and then we'll be getting

play13:02

the answers in the text format which

play13:04

will convert into speech using TTS

play13:07

service and in the end you'll be able to

play13:09

listen the answer to the speakers so in

play13:11

the last video we'll be making that

play13:12

complete Voice Assistant based on

play13:14

charity so do hit the Subscribe button

play13:15

if you don't want to miss out that

play13:17

Amazing Project which will coming soon

play13:19

on our channel so yeah that was it about

play13:22

this video I hope you find this speech

play13:24

to text thing interesting and useful and

play13:27

if it's a so well do hit the like button

play13:30

which will tell you to algorithm that

play13:31

this video was worth watching and it

play13:33

will share with other viewers as well

play13:36

and yeah that being said I am just

play13:39

ending this video here and now just wait

play13:41

for my next video another explore learn

play13:43

share with me techie SMS

play13:48

foreign

play13:53

[Music]

Rate This

5.0 / 5 (0 votes)

相关标签
ESP32Voice AssistantSpeech-to-TextGoogle CloudArduino CodeMicrophone InputTTS ServiceDIY ProjectTech TutorialIoT Development
您是否需要英文摘要?