RIP ELEVENLABS! Create BEST TTS AI Voices LOCALLY For FREE!
Summary
TLDRThe video script introduces viewers to a comprehensive guide on creating custom text-to-speech (TTS) voices using AI on a local computer. The presenter, SK, outlines various methods ranging from a quick 10-second voice cloning technique to a more sophisticated, high-quality TTS model training process that requires only 2 minutes of audio. The video demonstrates how to install necessary software, use different web UIs for voice cloning and fine-tuning, and integrate the generated TTS audio with RVC (Reverse Voice Conversion) for enhanced voice quality. The ultimate goal is to enable users to produce high-fidelity TTS without incurring hefty fees for third-party services, offering a cost-effective solution for personalized voice generation.
Takeaways
- 🎉 You can create custom text-to-speech (TTS) AI voices on your local computer without paying high fees for pre-made AI voices.
- 🔧 There are various methods available, ranging from quick 10-second voice cloning to more sophisticated techniques for higher quality TTS.
- 📈 The process starts with installing necessary software like FFMpeg and Python, and can be done via one-click installers for patrons or manually.
- 📊 A graphic is provided to visualize the different methods for creating TTS voices, catering to different user needs and skill levels.
- ⏱ With just 10 seconds of audio, you can clone a voice using the XTTS web UI, which is the easiest and quickest method demonstrated.
- 📚 For better quality, you can train your own XTTS model using only 2 minutes of audio, which captures the nuances of the speaker's voice.
- 🔗 The training process is straightforward and does not require a powerful GPU, making it accessible for most users.
- 🤖 By using RVC (Reverse Voice Conversion), you can further improve the TTS audio to closely resemble the original voice, even of public figures.
- 🌐 There's an XTS-RVC UI that automates the process of generating TTS audio and then converting it with RVC, simplifying the workflow.
- 📝 The final 'Uber' method combines fine-tuned XTTS models with RVC for the highest quality and authenticity in TTS voice generation.
- 💾 Once you have your custom TTS model, you can use it without limitations, making it a cost-effective solution for voice generation needs.
- 📁 The script provides guidance on how to install and use the different tools, including tips for patrons and manual installation steps for all users.
Q & A
What is the purpose of the video?
-The video aims to guide viewers on how to create custom text-to-speech AI voices on their local computer using various methods, from quick cloning with a short audio clip to training a more sophisticated model for higher quality results.
What are the two ways to install the required software for creating custom AI voices?
-The two ways to install the required software are using the one-click installer available for Patreon supporters, which automatically installs FFMpeg and adds it to the path, and the manual way which requires having Python, FFMpeg, and the C++ build tools installed beforehand.
How long of an audio clip is needed for the simplest voice cloning method?
-For the simplest voice cloning method, only 10 seconds of an audio clip is needed.
What is the minimum duration of audio required for training a custom text-to-speech model in the medium method?
-In the medium method, a minimum of 2 minutes of audio is required for training a custom text-to-speech model.
How does the RVC software contribute to the final output of the text-to-speech process?
-RVC (Resemblyzer Voice Converter) is used to further refine the generated text-to-speech audio by converting it to a voice that closely resembles a provided reference voice, significantly enhancing the quality and authenticity of the output.
What is the 'Uber text to speech method' and how does it differ from the medium method?
-The 'Uber text to speech method' is a combination approach that involves using a fine-tuned XTTS model to generate audio and then importing that into RVC for further enhancement. It differs from the medium method by including the step of fine-tuning the model from scratch, which allows for more personalized and higher quality voice replication.
How can one obtain the PDF guide for remembering the steps to create custom AI voices?
-The PDF guide can be obtained for free on the creator's Patreon page, which is linked in the video description.
What is the advantage of using the XTTS fine-tune web UI for training a custom model?
-The XTTS fine-tune web UI allows users to train their own text-to-speech model using a relatively short audio clip. This enables the model to learn the specific accent, speaking style, and voice characteristics of the speaker, leading to a more personalized and accurate voice output.
What is the role of FFMpeg in the process of creating custom AI voices?
-FFMpeg is a multimedia framework that is required for the installation process. It is used for handling various multimedia files and is essential for the proper functioning of the text-to-speech software.
How does the XTTS RVC UI automate the process of creating custom AI voices?
-The XTTS RVC UI automates the process by integrating both the text-to-speech generation and the voice conversion steps into a single interface. Users can input text, select an RVC voice model, and upload a reference voice sample to automatically generate and convert the voice.
What is the significance of using a longer audio clip for training the text-to-speech model?
-Using a longer audio clip, ideally around 10 minutes, provides the model with more data to learn from, which can result in a more accurate and higher quality replication of the speaker's voice.
How does the video help those who are tired of paying high fees for AI voice services?
-The video provides a comprehensive guide on how to create custom AI voices without the need for expensive third-party services, empowering users to generate high-quality voice outputs at a fraction of the cost.
Outlines
😀 Introduction to Custom Text-to-Speech AI Voices
The video begins with an introduction to the process of creating custom text-to-speech AI voices. The host, SK, expresses enthusiasm about the various methods that will be covered, ranging from quick 10-second voice cloning to more advanced techniques for creating high-quality AI voices. The video offers a graphic to help viewers visualize the different methods that will be discussed. The host also provides instructions for installing necessary software, with options for both one-click installation for patrons and a manual installation process for others. The first method demonstrated is the quick cloning technique using just 10 seconds of audio.
🤖 Training Your Own Text-to-Speech Model
The second paragraph delves into training a custom text-to-speech model using only 2 minutes of audio. The host guides viewers through using the xtts fine-tune web UI, emphasizing that even a short audio clip can be used for training. The process involves creating a dataset, training the model with default settings, and optimizing the model for easier use. The host also mentions the importance of using the correct version of the software and provides a demonstration of the improved voice quality achieved through this method. The training allows for the replication of the speaker's accent, speech patterns, and other unique vocal characteristics.
🎤 Advanced Voice Cloning with RVC
The third paragraph introduces RVC, a powerful tool for voice cloning, which is used to further refine the text-to-speech output. The host explains that while RVC requires an initial audio file for conversion, it can produce highly accurate voice clones. The process involves generating text-to-speech audio and then using RVC to enhance it. The host also discusses a semi-automated method using the XTS RVC UI, which streamlines the process by automatically converting the text-to-speech audio with RVC. This method offers less manual control but is quicker and easier to use.
🚀 The Ultimate Text-to-Speech Method
The final paragraph describes the 'Uber text-to-speech' method, which combines the previous techniques to create an even higher quality voice output. The host demonstrates how to use a fine-tuned XTS model within the XTS web UI to generate audio, which is then further enhanced using RVC. The host also explains how to use the fine-tuned model with the XTS RVC UI for fully automated processing. The video concludes with a reminder that all the methods shown allow for the creation of high-quality, authentic-sounding AI voices without the need for expensive third-party software. The host encourages viewers to try out the methods and offers support through Patreon, where a PDF guide will also be made available.
Mindmap
Keywords
💡Text to Speech AI
💡Voice Cloning
💡XTTS (eXtreme Text-to-Speech)
💡FFmpeg
💡Python
💡Training a Model
💡RVC (Resemblyzer Voice Cloning)
💡Web UI
💡Fine-tuning
💡Batch File
💡Deep Learning
Highlights
Introduction to creating custom text-to-speech AI voices on your local computer without high fees.
Demonstration of various methods ranging from quick 10-second voice cloning to the ultimate text-to-speech voice.
Explanation on installing necessary software using one-click installer for Patreon supporters or manual installation.
Guidance on using FFMpeg and setting up the environment for text-to-speech voice generation.
How to clone a voice with just 10 seconds of audio using the XTTS web UI.
No character limit for the text input in the simple text-to-voice tab of the web UI.
Training your own text-to-speech model from scratch using only 2 minutes of audio.
Using Audacity to extend a short audio clip into a longer one for training purposes.
Details on the training process and the importance of using a longer audio clip for better results.
Optimizing the model for faster and more efficient use after training.
Replicating the accent, speech patterns, and unique quirks of a speaker in the generated audio.
Combining text-to-speech with RVC (Reverse Voice Conversion) for higher quality voice cloning.
Automatic conversion of text-to-speech audio using the XTS-RVC UI for ease of use.
Creating a custom Obama text-to-speech model and using it within the XTS web UI.
Achieving a highly authentic and quality text-to-speech output by fine-tuning and converting with RVC.
Availability of a PDF guide on Patreon for free to help remember the process.
Offer of priority support to Patreon supporters for any questions regarding the process.
Encouragement for viewers to try out the methods and have fun creating their own text-to-speech AI voices.
Transcripts
are you tired of the same old robotic
text to speech AI voices are you sick of
paying exorbitant fees for these AI
voices do you dream of creating your own
custom text to speech EA voices on your
own computer well today is your day
because this is the ultimate local text
speech AI video ever so that you can
replace yourself with the clickable
button hello humans my name is SK and oh
boy sit tight because today I'm going to
show you the ultimate way to get the
best text speech AI voices on your local
computer and I'll be showing you a range
of methods from the super lazy 10
seconds voice cloning to the ultimate
Uber absolute best text speech voice
possible I even made a little graphic to
help you visualize because of how many
things I'm going to show you today so
that no matter who you are and what your
goals are you will get the best results
possible to suit your needs so sit back
relax and let's begin okay so let's
start by installing all of these
softwares first and then I'll explain
all the different methods and to install
these you have two ways the first is of
course by using the one click installer
that is available for my Pat supporters
just download the two links of to your
computer then before the install you
need to actually launch the FFM Peg
install as admin this is the only file
that you need to run as administrator
because it will install FFM Peg and will
automatically add it to path and the
second is the ultimate text speech Auto
installer so just double click it and
then it will ask you which WB UI you
want to install now for this video I'm
going to choose four because I'm going
to install all of them but if you want
to install a particular wave UI for your
particular needs you can do it as well
and then you're going to press enter and
then will automatically begin the
installation oh and also after each
install is complete do not forget to
close the window press no if it asks you
so that the next installation can start
so that in the end all the three web UI
are installed automatically simple as
that you don't need to do anything and
the second way to install these is of
course the manual way so for this make
sure that you have python G for Windows
FFM Peg and the C++ build tools
installed onto your computer before
doing the installation so then first
we're going to inst inst the xtts withb
UI so you're going to click the
description down below you're going to
arrive on this page you're going to
click on this code button then on this
icon to copy this entire line then
inside the folder you're going to click
on the folder path type CMD press enter
and here we're going to clone the
repository so you're going to type get
clone and then paste the URL of the
repository then press enter then we're
going to go inside the folder and then
we're going to Simply largely install
the bat file and this will automatically
start the installation so after this is
done we're going to install xtts fine
tune webui which is very similar to the
first VII so once again click the second
link in the description down below then
click on code copy this entire URL then
once again clone the repository like we
did before then we go inside the folder
and then just like the other one we're
going to run the install.bat file so
then finally we're going to install the
third web UI so once again we're going
to copy the URL then copy the repository
then we going inside the folder then
we're going to create a new python
environment with this command then we're
going to activate the environment then
we're going to install torch and torch
audio with this command then we need to
install the requirements with this
command and there you go so now that we
have all of our wave UI installed where
do we actually begin well let's start
from the very beginning from the very
easiest that requires the least amount
of effort and work and that is the
simple quick cloning with 10 seconds of
audio that's right you only need 10
seconds of an audio clip to be able to
clone that voice and use it as much as
you want and we're going to do that
inside the XTS web UI so to do this
you're going to go inside the xtts withi
folder and then launch the start xtts
withi dobot file it will then download a
bunch of stuff and then the UI will
launch automatically so once you are
inside the wui you will see a bunch of
stuff but I can tell you right now
actually um this web UI is a little
broken meaning that the only real tab
that you can use is the simple text to
voice tab but don't worry it's
absolutely fine this is the only thing
that we really need anyway so then how
exactly do we use this well first it's
very simple all you need to do is just
input a text right here in put the text
that you want your voice to read so
something like I don't know hey guys
what's up or anything that you want
there is absolutely no character limit
but this is just like an example so then
you're going to scroll down you're going
to choose your language you do have a
choice of a bunch of different languages
which is really really cool then you're
going to scroll down here in this little
section you're going to upload your 10
seconds of voice clip where you can also
do less if you have like a voice clip
between 5 and 10 seconds this is more
than enough so like for example I
uploaded this clip from like an Obama
interview that is around like 38 seconds
long and in here without changing
anything you can just click and generate
and in only only a few seconds like it
took me like 2 seconds to generate we
get something feel like this hey guys
what's up and well there you go this is
it this is basically like the easiest
and laziest way to generate T to speech
from an existing audio clip and what's
really cool is that there's basically
like no limit of character I mean there
is probably a limit but it is very very
high so you can just like copy and paste
like this whole paragraph like the
script from the B Movie and then just
simply click generate and in only a few
seconds it should very quickly generate
your audio file ready to be used
according to all known laws of Aviation
there is no way a bee should be able to
fly so yeah there you go now obviously
this is not perfect but you will see
that very later on we will make it much
much better but as I say previously this
is the laziest way to create text to
speech on your local computer using the
quick cloning technique with 10 seconds
of audio now don't worry we're going to
come back to this with you a later but
for now this is the end of the first
method but how do we make it even better
well how about instead of using the
default text to speech xtts model we
train our own xtts model that's right
we're going to train our own text to
speech model from scratch which is why
this is called the medium text to speech
method we're going to find tune our xtts
model using only 2 minutes of audio
that's right not 10 minutes not 20
minutes only 2 minutes required and it's
actually really really easy and super
fast so for this we need to use the xtts
find tune web UI so you're going to go
inside the folder and then launch the
start.bat file and it will then give you
a local URL that you can just hold
control and then left click to open in
your browser so from this all you really
need as I said previously is only an
audio file with 2 minutes of audio and
that you're going to upload right here
now I'm actually going to give you a
little trick because although this is
only a 2minute voice file because of was
extremely lazy all I did is basically
took the 37 seconds of audio that I used
previously and all I did is put this
file inside audacity then I selected
this entire audio copied it and then
paste it multiple times one after the
other again and again and again until I
had a 2 minutes of audio that's right
that's all I really did so basically you
don't even need 2 minutes of continuous
audio to do a decent training now
obviously it is better if you do don't
be lazy like me if you really want a
good results you should definitely
strive for something like 10 minutes of
audio especially because you'll be able
to use this later on in the video but
once again if you're really really lazy
you can do it like me so basically once
you have imputed the audio you're going
to leave everything by default don't not
forget of course to choose your language
in my case it is English then you're
going to click on this little button
step one create data set so depending on
how long the audio file is the longer
the formatting will take and the
training doesn't even use a lot of vram
so pretty much anyone can use this so
you don't even need a very powerful GPU
to do this so that's really cool so
there you go for me it took less than 1
minute so now we can click on the second
tab where you basic basically going to
leave everything by default the only
thing that you can change if you really
want to is the number of epoch now six
for the number of epo is really like the
minimum one so you might want to
increase this something like 10 or maybe
12 but if this is the first time that
you train a XTS model I definitely
recommend leaving everything by default
because all of these are super optimized
already and they work really really well
also make sure that you always use the
2.0.2 version this is by far the best
version because the 2.0.3 is really not
as good then you going to click load
parameters from output folder so once
again just leave everything by default
and then simply click run the training
and then the training will start now I'm
not going to do it because I've already
done it before so I'm actually going to
you know stop the training but basically
once the training is finished it's just
going to say well training is done
you're then going to click on this
little optimize the model button that
will basically make the final files much
much smaller and easier to use and now
if we want to use this you're going to
click on the third tab called
interference click load parameters for
TTS from output folder it will say that
the parameters for TTS were loaded then
you going to click on this button to
load the model and then to test it out
basically just input the text and then
click interference and after a few
seconds we get something like this uh
this model sounds really good and above
all it's reasonably fast so yeah as you
heard this is much much better this is
very very close to the reference audio
that we used for training uh but
ultimately I had so much confidence now
the reason why we do this training is
that F Unity the XTS model allows you to
train on the accent of the speaker the
way the person speak the speed as well
as a few works so like for example if
you listen to the reference audio uh but
ultimately I had so much confidence in
as you can see there is a lot of like
poses like uh uh Etc and all of these
poses and quirks and sounds are
replicated inside the generated audio so
like for example listen to this hello uh
humans my name is k your AI Overlord uh
uh oh so yeah there you go as you heard
in the very beginning there is a lot of
these uh uh sounds that were present in
the reference audio that are now
replicated inside the generated audio so
doing this training doing this fine
tuning even with the small portion of
audio that we had it was enough for the
model to be able to replicate that voice
and now that you have this model you can
use it as much as you want there is
absolutely no limitation so yeah there
you go with have only 2 minutes of audio
which is actually only 30 seconds of
audio we still managed to train a pretty
cool text speed each model but how do we
make it even better that's right because
we're still not done if you remember
this was only the medium text to speech
method because now we can finally start
the ultimate text to speech combination
and inside that method there is even
three different methods that you can use
but essentially what we're doing is
taking the generated audio from text to
speech and putting it inside RVC to make
it even better now if you don't know
what RV see is I definitely recommend
you to watch this video first otherwise
you're not going to understand anything
from this point onward watching this
video is essential for the comprehension
of the rest of the video because RVC is
absolutely fantastic this is seriously
one of my favorite softwares of all time
because RVC allows you to basically
clone a voice to a basically perfect
level but RVC is only a voice to voice
conversion meaning that you need an
initial audio file before doing the
conversion and this is what we're going
to do inside method a which is a simple
conversion so inside the xtts web UI
you're going to input your text then put
the reference a your file and click
generate so that in the end we get
something like this hello everyone it's
me Barack Obama don't forget to
subscribe to my channel and leave a like
so once again not perfect but it is a
quick cloning so then you're going to
download the file so then you're going
to launch RVC once again if you are one
of my P supporters just use a one click
installer to install this and then from
this point on either you're going to
train your own voice from scratch as I
showed in my RVC video or download an
existing Voice already made by the
community so that you can then select it
inside the reference invoice then you're
going to copy the path of the file and
paste it right here and then simply
click convert and after a few seconds we
get something like this hello everyone
it's me Barack Obama don't forget to
subscribe to my channel and leave a like
so yeah as you heard now the voice is is
much much better I mean this is pretty
much the voice of Barack Obama and that
is probably because RVC is the most
powerful tool when it comes to cloning
voices so once you have created an audio
file you can easily convert it to any
voice that you want and you can of
course then use that file for anything
you want but all of that is a lot of
work I mean coming from xtts withi then
downloading the file putting it inside
RVC I mean it's nice and all but if only
there was a way to do it automatically
he hey boy well there actually is and
that is actually our next method using
the third web UI that I haven't talked
about yet called XTS RVC UI which
basically does everything that we just
did automatically so to use this you're
going to go inside the XTS RVC UI folder
then inside rvc's you're going to input
the RVC voice model which is basically
the pth file in the index file then
you're going to go back inside voices
you're going to input the reference
voice which is basically like the 10
seconds of audio that we use for cloning
and then you're going to back and launch
this sbot file and it will then give you
a local URL then you can hold control
then left click and then it will open to
web UI and then from here you're going
to choose the language the RVC model
then the voice sample then here you're
going to input your text you can leave
everything by default and then click
submit and in the end you should get
something like this hey guys what's up
it's me Barack don't forget to subscribe
to my channel and leave a like so yeah
basically as you see it first generated
an xtts audio and then automatically
converted with RVC so basically
everything that we did previously where
refer generated with xtts webui and then
put it inside RVC everything was done
automatically in one single click now it
is a nice Wii to use although compared
to the previous method it has a little
bit less functionality and parameters to
create the final audio that you want
however once again it is like a
compromise it requires less work so that
in the end you get some easy good
results and that's really pretty cool
but how do we make it even better that's
right we are still not done well let's
actually take a look at our guide and
let's take a look at our Uber text to
speech method and what is the Uber text
speech method well basically we're going
to take everything that we did
previously and put them all together so
basically we're going to take the fuud
xtts model to generate an audio inside
the xtts web UI and then import that
inside RVC because yes don't forget that
we fine tune a model from scratch and we
can use this whenever we want so for
this it's very simple we're going to go
inside the xtts fun webui folder go
inside F Chun models inside already and
this what you see right here is your
finetune model these files are your
finetune XTS models so what you're going
to do you're going to select them
control X to cut them then we're going
to go inside the xtts web UI inside
models now you're going to create a new
folder that you're going to name the
name of your speaker so in my case it is
Obama then we go inside and we paste
test our files right here and now if you
go back and we launch the webui now if
you click on select xtts model version
you will see our Obama XTS model that we
can now use inside the XTS web UI as
much as we want simple as that so once
again just input your text then input
the reference audio and now you can just
click generate and now we have our final
audio generated with our custom Obama
model inside the xtts web UI to give us
something like this hey guys so what do
you think of this pretty cool video am I
right subscribe right now come on which
is already very very close to the
original Obama voice but of course we
can make it even better by downloading
the file then launching RVC selecting
the reference voice model inputed the
path of the file just like last time and
now we can click convert and finally
after a few seconds we get our final
ultimate Uber text to speech audio file
listen to this hey guys so what do you
think of this pretty cool video on my
right subscribe right now come on I mean
come on this is really really good this
is pretty much the closest and most
powerful text to speech that you can do
on your local computer right now with
the highest level of quality the highest
level of authenticity and all of that
running on your local computer oh and
also since once again it is the same
model you can use your fin tun model
inside the XTS RVC UI so if you go ins
set models xtts copy those files and put
it like somewhere else in a separate
folder just in case and then if you use
like thein tune models once again just
copy those files inside the models xtts
folder contrl V replace all the files in
the destination and then it is ready to
be used automatically inside the xtts
RVC UI simple as that so yeah there you
go these were all the methods that you
can use to have the best text to speech
EA models running on your computer and
if you want to have like this little
graph so that you always remember what
to do I will make the PDF available for
free on my patreon in the description
down below and talk it off patreon do
not forget that I provide priority
support from my patreon supporters so if
you have any questions whatsoever do not
hesitate to send me a DM and I will try
to answer your question as soon as
possible the link for my patreon will be
in the description down below so yeah
there you go now you know pretty much
everything there is to know on how to
run the best T speech models running on
your local computer so that no matter
who you are and what are your project
you can do so without paying some
exorbitant fees for some third party
software so definitely try this out
yourself and have some fun and there we
are it folks thank you guys so much for
watching don't forget to subscribe and
smash the like button for the YouTube
algorithm thank you also so much to
supporters for supporting my videos you
guys are absolutely awesome you people
are literally the reason why I'm able to
make these videos so thank you so much
and I'll see you guys next time bye-bye
Browse More Related Video
The Secrets Behind Voice Cloning & AI Covers
Free AI Avatar Cloning is Finally here
GitHub's Devin Competitor, Sam Altman Talks GPT-5 and AGI, Amazon Q, Rabbit R1 Hacked (AI News)
How To Make Videos Using AI || Without Face & Voice | Earn ₹2 Lakh / Month
Всё о новой нейросети GPT-4o за 7 минут!
L'APP DI IA CHE TI FA FARE IL TRIPLO DELLE COSE NELLA META' DEL TEMPO! [Writener tutorial completo]
5.0 / 5 (0 votes)