RIP ELEVENLABS! Create BEST TTS AI Voices LOCALLY For FREE!

Aitrepreneur
9 May 202417:45

Summary

TLDRThe video script introduces viewers to a comprehensive guide on creating custom text-to-speech (TTS) voices using AI on a local computer. The presenter, SK, outlines various methods ranging from a quick 10-second voice cloning technique to a more sophisticated, high-quality TTS model training process that requires only 2 minutes of audio. The video demonstrates how to install necessary software, use different web UIs for voice cloning and fine-tuning, and integrate the generated TTS audio with RVC (Reverse Voice Conversion) for enhanced voice quality. The ultimate goal is to enable users to produce high-fidelity TTS without incurring hefty fees for third-party services, offering a cost-effective solution for personalized voice generation.

Takeaways

  • 🎉 You can create custom text-to-speech (TTS) AI voices on your local computer without paying high fees for pre-made AI voices.
  • 🔧 There are various methods available, ranging from quick 10-second voice cloning to more sophisticated techniques for higher quality TTS.
  • 📈 The process starts with installing necessary software like FFMpeg and Python, and can be done via one-click installers for patrons or manually.
  • 📊 A graphic is provided to visualize the different methods for creating TTS voices, catering to different user needs and skill levels.
  • ⏱ With just 10 seconds of audio, you can clone a voice using the XTTS web UI, which is the easiest and quickest method demonstrated.
  • 📚 For better quality, you can train your own XTTS model using only 2 minutes of audio, which captures the nuances of the speaker's voice.
  • 🔗 The training process is straightforward and does not require a powerful GPU, making it accessible for most users.
  • 🤖 By using RVC (Reverse Voice Conversion), you can further improve the TTS audio to closely resemble the original voice, even of public figures.
  • 🌐 There's an XTS-RVC UI that automates the process of generating TTS audio and then converting it with RVC, simplifying the workflow.
  • 📝 The final 'Uber' method combines fine-tuned XTTS models with RVC for the highest quality and authenticity in TTS voice generation.
  • 💾 Once you have your custom TTS model, you can use it without limitations, making it a cost-effective solution for voice generation needs.
  • 📁 The script provides guidance on how to install and use the different tools, including tips for patrons and manual installation steps for all users.

Q & A

  • What is the purpose of the video?

    -The video aims to guide viewers on how to create custom text-to-speech AI voices on their local computer using various methods, from quick cloning with a short audio clip to training a more sophisticated model for higher quality results.

  • What are the two ways to install the required software for creating custom AI voices?

    -The two ways to install the required software are using the one-click installer available for Patreon supporters, which automatically installs FFMpeg and adds it to the path, and the manual way which requires having Python, FFMpeg, and the C++ build tools installed beforehand.

  • How long of an audio clip is needed for the simplest voice cloning method?

    -For the simplest voice cloning method, only 10 seconds of an audio clip is needed.

  • What is the minimum duration of audio required for training a custom text-to-speech model in the medium method?

    -In the medium method, a minimum of 2 minutes of audio is required for training a custom text-to-speech model.

  • How does the RVC software contribute to the final output of the text-to-speech process?

    -RVC (Resemblyzer Voice Converter) is used to further refine the generated text-to-speech audio by converting it to a voice that closely resembles a provided reference voice, significantly enhancing the quality and authenticity of the output.

  • What is the 'Uber text to speech method' and how does it differ from the medium method?

    -The 'Uber text to speech method' is a combination approach that involves using a fine-tuned XTTS model to generate audio and then importing that into RVC for further enhancement. It differs from the medium method by including the step of fine-tuning the model from scratch, which allows for more personalized and higher quality voice replication.

  • How can one obtain the PDF guide for remembering the steps to create custom AI voices?

    -The PDF guide can be obtained for free on the creator's Patreon page, which is linked in the video description.

  • What is the advantage of using the XTTS fine-tune web UI for training a custom model?

    -The XTTS fine-tune web UI allows users to train their own text-to-speech model using a relatively short audio clip. This enables the model to learn the specific accent, speaking style, and voice characteristics of the speaker, leading to a more personalized and accurate voice output.

  • What is the role of FFMpeg in the process of creating custom AI voices?

    -FFMpeg is a multimedia framework that is required for the installation process. It is used for handling various multimedia files and is essential for the proper functioning of the text-to-speech software.

  • How does the XTTS RVC UI automate the process of creating custom AI voices?

    -The XTTS RVC UI automates the process by integrating both the text-to-speech generation and the voice conversion steps into a single interface. Users can input text, select an RVC voice model, and upload a reference voice sample to automatically generate and convert the voice.

  • What is the significance of using a longer audio clip for training the text-to-speech model?

    -Using a longer audio clip, ideally around 10 minutes, provides the model with more data to learn from, which can result in a more accurate and higher quality replication of the speaker's voice.

  • How does the video help those who are tired of paying high fees for AI voice services?

    -The video provides a comprehensive guide on how to create custom AI voices without the need for expensive third-party services, empowering users to generate high-quality voice outputs at a fraction of the cost.

Outlines

00:00

😀 Introduction to Custom Text-to-Speech AI Voices

The video begins with an introduction to the process of creating custom text-to-speech AI voices. The host, SK, expresses enthusiasm about the various methods that will be covered, ranging from quick 10-second voice cloning to more advanced techniques for creating high-quality AI voices. The video offers a graphic to help viewers visualize the different methods that will be discussed. The host also provides instructions for installing necessary software, with options for both one-click installation for patrons and a manual installation process for others. The first method demonstrated is the quick cloning technique using just 10 seconds of audio.

05:02

🤖 Training Your Own Text-to-Speech Model

The second paragraph delves into training a custom text-to-speech model using only 2 minutes of audio. The host guides viewers through using the xtts fine-tune web UI, emphasizing that even a short audio clip can be used for training. The process involves creating a dataset, training the model with default settings, and optimizing the model for easier use. The host also mentions the importance of using the correct version of the software and provides a demonstration of the improved voice quality achieved through this method. The training allows for the replication of the speaker's accent, speech patterns, and other unique vocal characteristics.

10:04

🎤 Advanced Voice Cloning with RVC

The third paragraph introduces RVC, a powerful tool for voice cloning, which is used to further refine the text-to-speech output. The host explains that while RVC requires an initial audio file for conversion, it can produce highly accurate voice clones. The process involves generating text-to-speech audio and then using RVC to enhance it. The host also discusses a semi-automated method using the XTS RVC UI, which streamlines the process by automatically converting the text-to-speech audio with RVC. This method offers less manual control but is quicker and easier to use.

15:06

🚀 The Ultimate Text-to-Speech Method

The final paragraph describes the 'Uber text-to-speech' method, which combines the previous techniques to create an even higher quality voice output. The host demonstrates how to use a fine-tuned XTS model within the XTS web UI to generate audio, which is then further enhanced using RVC. The host also explains how to use the fine-tuned model with the XTS RVC UI for fully automated processing. The video concludes with a reminder that all the methods shown allow for the creation of high-quality, authentic-sounding AI voices without the need for expensive third-party software. The host encourages viewers to try out the methods and offers support through Patreon, where a PDF guide will also be made available.

Mindmap

Keywords

💡Text to Speech AI

Text to Speech AI refers to artificial intelligence systems that can convert written text into audible speech. In the video, it is the central theme as the host discusses various methods to create custom voices for this technology without incurring high fees.

💡Voice Cloning

Voice cloning is the process of replicating a person's unique voice using AI. The video demonstrates how to clone a voice with just 10 seconds of audio, which is a significant aspect of creating custom text to speech AI voices.

💡XTTS (eXtreme Text-to-Speech)

XTTS is a tool mentioned in the video used for text-to-speech conversion. It is used to create custom voice models and is a key component in the methods discussed for generating AI voices.

💡FFmpeg

FFmpeg is a free and open-source software project that handles multimedia data. In the context of the video, it is used as a part of the setup process for the text-to-speech systems discussed.

💡Python

Python is a high-level programming language that is widely used for various types of software development. In the video, it is mentioned as a prerequisite for manually installing the text-to-speech software.

💡Training a Model

Training a model in the context of the video refers to the process of teaching a machine learning algorithm to generate speech by providing it with sample audio data. This is a crucial step in creating a personalized text-to-speech voice.

💡RVC (Resemblyzer Voice Cloning)

RVC is a voice conversion software that can clone voices with high accuracy. It is used in the video to further refine the generated AI voice, making it sound more like the original speaker.

💡Web UI

Web UI stands for Web User Interface, which in the video refers to the graphical interfaces used to interact with the text-to-speech software without needing to use command-line instructions. They simplify the process of voice generation and model training.

💡Fine-tuning

Fine-tuning is a machine learning technique where a pre-trained model is further trained on a specific task to improve its performance. In the video, the host explains how to fine-tune an XTTS model using just 2 minutes of audio.

💡Batch File

A batch file is a script file in DOS, OS/2 and Windows that contains a series of commands to be executed by the command-line interpreter. In the video, batch files are used to automate the installation process of the text-to-speech software.

💡Deep Learning

Deep learning is a subset of machine learning that uses neural networks with many layers (hence 'deep') to analyze various factors of data. The video implies the use of deep learning in training the AI to accurately replicate human speech.

Highlights

Introduction to creating custom text-to-speech AI voices on your local computer without high fees.

Demonstration of various methods ranging from quick 10-second voice cloning to the ultimate text-to-speech voice.

Explanation on installing necessary software using one-click installer for Patreon supporters or manual installation.

Guidance on using FFMpeg and setting up the environment for text-to-speech voice generation.

How to clone a voice with just 10 seconds of audio using the XTTS web UI.

No character limit for the text input in the simple text-to-voice tab of the web UI.

Training your own text-to-speech model from scratch using only 2 minutes of audio.

Using Audacity to extend a short audio clip into a longer one for training purposes.

Details on the training process and the importance of using a longer audio clip for better results.

Optimizing the model for faster and more efficient use after training.

Replicating the accent, speech patterns, and unique quirks of a speaker in the generated audio.

Combining text-to-speech with RVC (Reverse Voice Conversion) for higher quality voice cloning.

Automatic conversion of text-to-speech audio using the XTS-RVC UI for ease of use.

Creating a custom Obama text-to-speech model and using it within the XTS web UI.

Achieving a highly authentic and quality text-to-speech output by fine-tuning and converting with RVC.

Availability of a PDF guide on Patreon for free to help remember the process.

Offer of priority support to Patreon supporters for any questions regarding the process.

Encouragement for viewers to try out the methods and have fun creating their own text-to-speech AI voices.

Transcripts

play00:00

are you tired of the same old robotic

play00:01

text to speech AI voices are you sick of

play00:04

paying exorbitant fees for these AI

play00:06

voices do you dream of creating your own

play00:08

custom text to speech EA voices on your

play00:11

own computer well today is your day

play00:13

because this is the ultimate local text

play00:16

speech AI video ever so that you can

play00:18

replace yourself with the clickable

play00:20

button hello humans my name is SK and oh

play00:23

boy sit tight because today I'm going to

play00:25

show you the ultimate way to get the

play00:28

best text speech AI voices on your local

play00:30

computer and I'll be showing you a range

play00:32

of methods from the super lazy 10

play00:34

seconds voice cloning to the ultimate

play00:36

Uber absolute best text speech voice

play00:39

possible I even made a little graphic to

play00:41

help you visualize because of how many

play00:43

things I'm going to show you today so

play00:44

that no matter who you are and what your

play00:46

goals are you will get the best results

play00:48

possible to suit your needs so sit back

play00:51

relax and let's begin okay so let's

play00:53

start by installing all of these

play00:54

softwares first and then I'll explain

play00:56

all the different methods and to install

play00:58

these you have two ways the first is of

play01:00

course by using the one click installer

play01:02

that is available for my Pat supporters

play01:04

just download the two links of to your

play01:05

computer then before the install you

play01:07

need to actually launch the FFM Peg

play01:09

install as admin this is the only file

play01:11

that you need to run as administrator

play01:13

because it will install FFM Peg and will

play01:15

automatically add it to path and the

play01:16

second is the ultimate text speech Auto

play01:18

installer so just double click it and

play01:20

then it will ask you which WB UI you

play01:22

want to install now for this video I'm

play01:23

going to choose four because I'm going

play01:25

to install all of them but if you want

play01:27

to install a particular wave UI for your

play01:29

particular needs you can do it as well

play01:31

and then you're going to press enter and

play01:32

then will automatically begin the

play01:34

installation oh and also after each

play01:35

install is complete do not forget to

play01:37

close the window press no if it asks you

play01:40

so that the next installation can start

play01:41

so that in the end all the three web UI

play01:43

are installed automatically simple as

play01:45

that you don't need to do anything and

play01:47

the second way to install these is of

play01:48

course the manual way so for this make

play01:50

sure that you have python G for Windows

play01:53

FFM Peg and the C++ build tools

play01:55

installed onto your computer before

play01:57

doing the installation so then first

play01:59

we're going to inst inst the xtts withb

play02:01

UI so you're going to click the

play02:02

description down below you're going to

play02:03

arrive on this page you're going to

play02:04

click on this code button then on this

play02:06

icon to copy this entire line then

play02:08

inside the folder you're going to click

play02:09

on the folder path type CMD press enter

play02:12

and here we're going to clone the

play02:13

repository so you're going to type get

play02:15

clone and then paste the URL of the

play02:17

repository then press enter then we're

play02:19

going to go inside the folder and then

play02:20

we're going to Simply largely install

play02:21

the bat file and this will automatically

play02:23

start the installation so after this is

play02:25

done we're going to install xtts fine

play02:27

tune webui which is very similar to the

play02:29

first VII so once again click the second

play02:31

link in the description down below then

play02:33

click on code copy this entire URL then

play02:35

once again clone the repository like we

play02:37

did before then we go inside the folder

play02:39

and then just like the other one we're

play02:40

going to run the install.bat file so

play02:42

then finally we're going to install the

play02:44

third web UI so once again we're going

play02:45

to copy the URL then copy the repository

play02:48

then we going inside the folder then

play02:49

we're going to create a new python

play02:50

environment with this command then we're

play02:52

going to activate the environment then

play02:54

we're going to install torch and torch

play02:55

audio with this command then we need to

play02:57

install the requirements with this

play02:59

command and there you go so now that we

play03:01

have all of our wave UI installed where

play03:03

do we actually begin well let's start

play03:06

from the very beginning from the very

play03:08

easiest that requires the least amount

play03:10

of effort and work and that is the

play03:11

simple quick cloning with 10 seconds of

play03:14

audio that's right you only need 10

play03:16

seconds of an audio clip to be able to

play03:18

clone that voice and use it as much as

play03:20

you want and we're going to do that

play03:21

inside the XTS web UI so to do this

play03:24

you're going to go inside the xtts withi

play03:26

folder and then launch the start xtts

play03:28

withi dobot file it will then download a

play03:30

bunch of stuff and then the UI will

play03:32

launch automatically so once you are

play03:33

inside the wui you will see a bunch of

play03:36

stuff but I can tell you right now

play03:37

actually um this web UI is a little

play03:40

broken meaning that the only real tab

play03:42

that you can use is the simple text to

play03:44

voice tab but don't worry it's

play03:46

absolutely fine this is the only thing

play03:47

that we really need anyway so then how

play03:49

exactly do we use this well first it's

play03:51

very simple all you need to do is just

play03:53

input a text right here in put the text

play03:55

that you want your voice to read so

play03:57

something like I don't know hey guys

play03:58

what's up or anything that you want

play04:00

there is absolutely no character limit

play04:02

but this is just like an example so then

play04:04

you're going to scroll down you're going

play04:05

to choose your language you do have a

play04:07

choice of a bunch of different languages

play04:09

which is really really cool then you're

play04:10

going to scroll down here in this little

play04:12

section you're going to upload your 10

play04:14

seconds of voice clip where you can also

play04:16

do less if you have like a voice clip

play04:17

between 5 and 10 seconds this is more

play04:19

than enough so like for example I

play04:21

uploaded this clip from like an Obama

play04:23

interview that is around like 38 seconds

play04:25

long and in here without changing

play04:27

anything you can just click and generate

play04:29

and in only only a few seconds like it

play04:30

took me like 2 seconds to generate we

play04:32

get something feel like this hey guys

play04:34

what's up and well there you go this is

play04:37

it this is basically like the easiest

play04:39

and laziest way to generate T to speech

play04:41

from an existing audio clip and what's

play04:43

really cool is that there's basically

play04:44

like no limit of character I mean there

play04:46

is probably a limit but it is very very

play04:48

high so you can just like copy and paste

play04:50

like this whole paragraph like the

play04:51

script from the B Movie and then just

play04:53

simply click generate and in only a few

play04:55

seconds it should very quickly generate

play04:57

your audio file ready to be used

play04:59

according to all known laws of Aviation

play05:01

there is no way a bee should be able to

play05:04

fly so yeah there you go now obviously

play05:06

this is not perfect but you will see

play05:08

that very later on we will make it much

play05:11

much better but as I say previously this

play05:13

is the laziest way to create text to

play05:15

speech on your local computer using the

play05:18

quick cloning technique with 10 seconds

play05:20

of audio now don't worry we're going to

play05:21

come back to this with you a later but

play05:23

for now this is the end of the first

play05:25

method but how do we make it even better

play05:28

well how about instead of using the

play05:31

default text to speech xtts model we

play05:34

train our own xtts model that's right

play05:38

we're going to train our own text to

play05:40

speech model from scratch which is why

play05:42

this is called the medium text to speech

play05:44

method we're going to find tune our xtts

play05:47

model using only 2 minutes of audio

play05:50

that's right not 10 minutes not 20

play05:52

minutes only 2 minutes required and it's

play05:55

actually really really easy and super

play05:57

fast so for this we need to use the xtts

play05:59

find tune web UI so you're going to go

play06:01

inside the folder and then launch the

play06:02

start.bat file and it will then give you

play06:04

a local URL that you can just hold

play06:07

control and then left click to open in

play06:09

your browser so from this all you really

play06:11

need as I said previously is only an

play06:13

audio file with 2 minutes of audio and

play06:16

that you're going to upload right here

play06:17

now I'm actually going to give you a

play06:18

little trick because although this is

play06:20

only a 2minute voice file because of was

play06:23

extremely lazy all I did is basically

play06:26

took the 37 seconds of audio that I used

play06:28

previously and all I did is put this

play06:30

file inside audacity then I selected

play06:32

this entire audio copied it and then

play06:34

paste it multiple times one after the

play06:37

other again and again and again until I

play06:39

had a 2 minutes of audio that's right

play06:42

that's all I really did so basically you

play06:44

don't even need 2 minutes of continuous

play06:46

audio to do a decent training now

play06:48

obviously it is better if you do don't

play06:50

be lazy like me if you really want a

play06:52

good results you should definitely

play06:54

strive for something like 10 minutes of

play06:55

audio especially because you'll be able

play06:57

to use this later on in the video but

play06:59

once again if you're really really lazy

play07:01

you can do it like me so basically once

play07:03

you have imputed the audio you're going

play07:05

to leave everything by default don't not

play07:06

forget of course to choose your language

play07:08

in my case it is English then you're

play07:10

going to click on this little button

play07:11

step one create data set so depending on

play07:14

how long the audio file is the longer

play07:16

the formatting will take and the

play07:17

training doesn't even use a lot of vram

play07:19

so pretty much anyone can use this so

play07:21

you don't even need a very powerful GPU

play07:23

to do this so that's really cool so

play07:25

there you go for me it took less than 1

play07:26

minute so now we can click on the second

play07:28

tab where you basic basically going to

play07:30

leave everything by default the only

play07:32

thing that you can change if you really

play07:33

want to is the number of epoch now six

play07:35

for the number of epo is really like the

play07:37

minimum one so you might want to

play07:39

increase this something like 10 or maybe

play07:41

12 but if this is the first time that

play07:43

you train a XTS model I definitely

play07:45

recommend leaving everything by default

play07:47

because all of these are super optimized

play07:49

already and they work really really well

play07:51

also make sure that you always use the

play07:52

2.0.2 version this is by far the best

play07:55

version because the 2.0.3 is really not

play07:58

as good then you going to click load

play08:00

parameters from output folder so once

play08:02

again just leave everything by default

play08:04

and then simply click run the training

play08:06

and then the training will start now I'm

play08:07

not going to do it because I've already

play08:09

done it before so I'm actually going to

play08:11

you know stop the training but basically

play08:13

once the training is finished it's just

play08:14

going to say well training is done

play08:17

you're then going to click on this

play08:18

little optimize the model button that

play08:20

will basically make the final files much

play08:22

much smaller and easier to use and now

play08:24

if we want to use this you're going to

play08:26

click on the third tab called

play08:27

interference click load parameters for

play08:29

TTS from output folder it will say that

play08:31

the parameters for TTS were loaded then

play08:34

you going to click on this button to

play08:35

load the model and then to test it out

play08:37

basically just input the text and then

play08:39

click interference and after a few

play08:41

seconds we get something like this uh

play08:43

this model sounds really good and above

play08:46

all it's reasonably fast so yeah as you

play08:48

heard this is much much better this is

play08:51

very very close to the reference audio

play08:52

that we used for training uh but

play08:55

ultimately I had so much confidence now

play08:57

the reason why we do this training is

play08:59

that F Unity the XTS model allows you to

play09:02

train on the accent of the speaker the

play09:04

way the person speak the speed as well

play09:06

as a few works so like for example if

play09:08

you listen to the reference audio uh but

play09:11

ultimately I had so much confidence in

play09:14

as you can see there is a lot of like

play09:15

poses like uh uh Etc and all of these

play09:18

poses and quirks and sounds are

play09:20

replicated inside the generated audio so

play09:23

like for example listen to this hello uh

play09:25

humans my name is k your AI Overlord uh

play09:28

uh oh so yeah there you go as you heard

play09:30

in the very beginning there is a lot of

play09:32

these uh uh sounds that were present in

play09:34

the reference audio that are now

play09:36

replicated inside the generated audio so

play09:39

doing this training doing this fine

play09:41

tuning even with the small portion of

play09:43

audio that we had it was enough for the

play09:45

model to be able to replicate that voice

play09:47

and now that you have this model you can

play09:48

use it as much as you want there is

play09:50

absolutely no limitation so yeah there

play09:52

you go with have only 2 minutes of audio

play09:54

which is actually only 30 seconds of

play09:56

audio we still managed to train a pretty

play09:58

cool text speed each model but how do we

play10:01

make it even better that's right because

play10:04

we're still not done if you remember

play10:06

this was only the medium text to speech

play10:08

method because now we can finally start

play10:11

the ultimate text to speech combination

play10:14

and inside that method there is even

play10:17

three different methods that you can use

play10:19

but essentially what we're doing is

play10:21

taking the generated audio from text to

play10:23

speech and putting it inside RVC to make

play10:26

it even better now if you don't know

play10:28

what RV see is I definitely recommend

play10:31

you to watch this video first otherwise

play10:33

you're not going to understand anything

play10:35

from this point onward watching this

play10:37

video is essential for the comprehension

play10:40

of the rest of the video because RVC is

play10:42

absolutely fantastic this is seriously

play10:44

one of my favorite softwares of all time

play10:47

because RVC allows you to basically

play10:49

clone a voice to a basically perfect

play10:52

level but RVC is only a voice to voice

play10:55

conversion meaning that you need an

play10:56

initial audio file before doing the

play10:58

conversion and this is what we're going

play11:00

to do inside method a which is a simple

play11:03

conversion so inside the xtts web UI

play11:06

you're going to input your text then put

play11:07

the reference a your file and click

play11:09

generate so that in the end we get

play11:11

something like this hello everyone it's

play11:13

me Barack Obama don't forget to

play11:15

subscribe to my channel and leave a like

play11:17

so once again not perfect but it is a

play11:20

quick cloning so then you're going to

play11:21

download the file so then you're going

play11:23

to launch RVC once again if you are one

play11:25

of my P supporters just use a one click

play11:27

installer to install this and then from

play11:29

this point on either you're going to

play11:31

train your own voice from scratch as I

play11:34

showed in my RVC video or download an

play11:37

existing Voice already made by the

play11:38

community so that you can then select it

play11:41

inside the reference invoice then you're

play11:42

going to copy the path of the file and

play11:44

paste it right here and then simply

play11:46

click convert and after a few seconds we

play11:49

get something like this hello everyone

play11:51

it's me Barack Obama don't forget to

play11:54

subscribe to my channel and leave a like

play11:56

so yeah as you heard now the voice is is

play11:59

much much better I mean this is pretty

play12:01

much the voice of Barack Obama and that

play12:03

is probably because RVC is the most

play12:06

powerful tool when it comes to cloning

play12:08

voices so once you have created an audio

play12:10

file you can easily convert it to any

play12:12

voice that you want and you can of

play12:14

course then use that file for anything

play12:15

you want but all of that is a lot of

play12:17

work I mean coming from xtts withi then

play12:20

downloading the file putting it inside

play12:22

RVC I mean it's nice and all but if only

play12:25

there was a way to do it automatically

play12:27

he hey boy well there actually is and

play12:30

that is actually our next method using

play12:32

the third web UI that I haven't talked

play12:34

about yet called XTS RVC UI which

play12:37

basically does everything that we just

play12:39

did automatically so to use this you're

play12:41

going to go inside the XTS RVC UI folder

play12:43

then inside rvc's you're going to input

play12:46

the RVC voice model which is basically

play12:48

the pth file in the index file then

play12:50

you're going to go back inside voices

play12:52

you're going to input the reference

play12:54

voice which is basically like the 10

play12:56

seconds of audio that we use for cloning

play12:58

and then you're going to back and launch

play13:00

this sbot file and it will then give you

play13:02

a local URL then you can hold control

play13:04

then left click and then it will open to

play13:05

web UI and then from here you're going

play13:07

to choose the language the RVC model

play13:09

then the voice sample then here you're

play13:11

going to input your text you can leave

play13:13

everything by default and then click

play13:14

submit and in the end you should get

play13:16

something like this hey guys what's up

play13:18

it's me Barack don't forget to subscribe

play13:20

to my channel and leave a like so yeah

play13:23

basically as you see it first generated

play13:24

an xtts audio and then automatically

play13:27

converted with RVC so basically

play13:29

everything that we did previously where

play13:30

refer generated with xtts webui and then

play13:33

put it inside RVC everything was done

play13:35

automatically in one single click now it

play13:37

is a nice Wii to use although compared

play13:40

to the previous method it has a little

play13:41

bit less functionality and parameters to

play13:44

create the final audio that you want

play13:46

however once again it is like a

play13:47

compromise it requires less work so that

play13:50

in the end you get some easy good

play13:52

results and that's really pretty cool

play13:54

but how do we make it even better that's

play13:58

right we are still not done well let's

play14:01

actually take a look at our guide and

play14:03

let's take a look at our Uber text to

play14:05

speech method and what is the Uber text

play14:08

speech method well basically we're going

play14:10

to take everything that we did

play14:11

previously and put them all together so

play14:14

basically we're going to take the fuud

play14:16

xtts model to generate an audio inside

play14:19

the xtts web UI and then import that

play14:22

inside RVC because yes don't forget that

play14:24

we fine tune a model from scratch and we

play14:27

can use this whenever we want so for

play14:30

this it's very simple we're going to go

play14:31

inside the xtts fun webui folder go

play14:34

inside F Chun models inside already and

play14:37

this what you see right here is your

play14:39

finetune model these files are your

play14:42

finetune XTS models so what you're going

play14:44

to do you're going to select them

play14:46

control X to cut them then we're going

play14:48

to go inside the xtts web UI inside

play14:51

models now you're going to create a new

play14:52

folder that you're going to name the

play14:54

name of your speaker so in my case it is

play14:56

Obama then we go inside and we paste

play14:59

test our files right here and now if you

play15:01

go back and we launch the webui now if

play15:03

you click on select xtts model version

play15:06

you will see our Obama XTS model that we

play15:10

can now use inside the XTS web UI as

play15:13

much as we want simple as that so once

play15:16

again just input your text then input

play15:18

the reference audio and now you can just

play15:20

click generate and now we have our final

play15:22

audio generated with our custom Obama

play15:25

model inside the xtts web UI to give us

play15:28

something like this hey guys so what do

play15:30

you think of this pretty cool video am I

play15:32

right subscribe right now come on which

play15:34

is already very very close to the

play15:37

original Obama voice but of course we

play15:39

can make it even better by downloading

play15:41

the file then launching RVC selecting

play15:43

the reference voice model inputed the

play15:45

path of the file just like last time and

play15:48

now we can click convert and finally

play15:50

after a few seconds we get our final

play15:52

ultimate Uber text to speech audio file

play15:55

listen to this hey guys so what do you

play15:57

think of this pretty cool video on my

play15:59

right subscribe right now come on I mean

play16:02

come on this is really really good this

play16:03

is pretty much the closest and most

play16:05

powerful text to speech that you can do

play16:07

on your local computer right now with

play16:10

the highest level of quality the highest

play16:12

level of authenticity and all of that

play16:14

running on your local computer oh and

play16:16

also since once again it is the same

play16:18

model you can use your fin tun model

play16:20

inside the XTS RVC UI so if you go ins

play16:23

set models xtts copy those files and put

play16:26

it like somewhere else in a separate

play16:28

folder just in case and then if you use

play16:30

like thein tune models once again just

play16:32

copy those files inside the models xtts

play16:35

folder contrl V replace all the files in

play16:37

the destination and then it is ready to

play16:39

be used automatically inside the xtts

play16:41

RVC UI simple as that so yeah there you

play16:44

go these were all the methods that you

play16:46

can use to have the best text to speech

play16:48

EA models running on your computer and

play16:50

if you want to have like this little

play16:51

graph so that you always remember what

play16:53

to do I will make the PDF available for

play16:55

free on my patreon in the description

play16:57

down below and talk it off patreon do

play16:59

not forget that I provide priority

play17:01

support from my patreon supporters so if

play17:03

you have any questions whatsoever do not

play17:05

hesitate to send me a DM and I will try

play17:07

to answer your question as soon as

play17:08

possible the link for my patreon will be

play17:10

in the description down below so yeah

play17:12

there you go now you know pretty much

play17:14

everything there is to know on how to

play17:15

run the best T speech models running on

play17:18

your local computer so that no matter

play17:20

who you are and what are your project

play17:22

you can do so without paying some

play17:23

exorbitant fees for some third party

play17:26

software so definitely try this out

play17:27

yourself and have some fun and there we

play17:29

are it folks thank you guys so much for

play17:31

watching don't forget to subscribe and

play17:32

smash the like button for the YouTube

play17:34

algorithm thank you also so much to

play17:36

supporters for supporting my videos you

play17:37

guys are absolutely awesome you people

play17:39

are literally the reason why I'm able to

play17:41

make these videos so thank you so much

play17:42

and I'll see you guys next time bye-bye

Rate This

5.0 / 5 (0 votes)

Related Tags
Text-to-SpeechVoice CloningAI TechnologyLocal ComputingCustom VoicesAudio ProcessingSoftware TutorialVoice SynthesisSpeech AITech Guide