RIP ELEVENLABS! Create BEST TTS AI Voices LOCALLY For FREE!

Aitrepreneur
9 May 202417:45

TLDRIn this informative video, the host SK guides viewers on how to create high-quality, custom text-to-speech (TTS) AI voices on their local computers for free. The video covers a range of methods, from a quick 10-second voice cloning technique to a more advanced, fine-tuned model training process that requires only 2 minutes of audio. The host also introduces the use of RVC (Reverse Voice Conversion) for further voice enhancement. The video concludes with a step-by-step guide on combining these methods for an 'Uber' TTS method, resulting in a highly authentic and personalized AI voice. The host provides resources for software installation and encourages viewers to try these methods to avoid costly third-party services.

Takeaways

  • 📢 The video is about creating high-quality, custom text-to-speech (TTS) AI voices on your local computer for free.
  • 💻 Two installation methods are provided: a one-click installer for supporters and a manual installation process.
  • 🔍 The one-click installer requires running FFMpeg as an administrator to add it to the system path.
  • 📝 For manual installation, you need Python for Windows, FFMpeg, and C++ build tools pre-installed.
  • 🔗 The video provides links in the description to clone repositories for different TTS models and web UIs.
  • ⏱️ A simple quick cloning method is demonstrated using just 10 seconds of an audio clip.
  • 🎓 The 'medium text-to-speech' method involves training your own TTS model with only 2 minutes of audio.
  • 📈 The 'Uber text-to-speech' method combines the fine-tuned TTS model with RVC (Reverse Voice Conversion) for higher quality.
  • 🤖 RVC is a powerful tool for cloning voices but requires an initial audio file for conversion.
  • 🌐 An automatic method using the XTS RVC UI is introduced for easier conversion without manual steps.
  • 📈 The final method involves using a fine-tuned model within the XTS RVC UI for the best results.
  • 🎉 The presenter encourages viewers to try these methods out for themselves and offers support through Patreon.

Q & A

  • What is the main topic of the video?

    -The main topic of the video is about creating custom text-to-speech (TTS) AI voices locally on your computer for free.

  • What are the different methods introduced in the video for creating TTS AI voices?

    -The video introduces methods ranging from a quick 10-second voice cloning to training an ultimate, high-quality TTS voice model using only 2 minutes of audio.

  • What software is mentioned for installing necessary tools for TTS?

    -FFMpeg and Python are mentioned as necessary software for installing the tools required for creating TTS AI voices.

  • How much audio is needed for the simplest voice cloning method?

    -For the simplest voice cloning method, only 10 seconds of an audio clip is needed.

  • What is the benefit of using the one-click installer for supporters?

    -The one-click installer simplifies the installation process by automatically adding FFMpeg to the system path and installing the required web UIs with minimal user interaction.

  • How long does it take to generate a TTS voice using the quick cloning method?

    -It takes only a few seconds to generate a TTS voice using the quick cloning method.

  • What is the medium text-to-speech method?

    -The medium text-to-speech method involves training your own TTS model from scratch using just 2 minutes of audio.

  • What is RVC and how is it used in the ultimate text-to-speech method?

    -RVC is a voice cloning software that can clone a voice to a near-perfect level. In the ultimate text-to-speech method, it is used to further refine the generated audio from the TTS model to make it sound even more authentic.

  • What is the 'XTS RVC UI' and how does it simplify the process?

    -The 'XTS RVC UI' is a web user interface that automates the process of generating TTS audio and then converting it with RVC. It simplifies the process by doing everything with a single click, reducing the need for manual file transfers between different software.

  • What does the 'Uber text-to-speech method' involve?

    -The 'Uber text-to-speech method' is a combination of using a fine-tuned TTS model to generate audio and then importing that audio into RVC for further enhancement, resulting in a highly authentic and high-quality voice output.

  • How can the user ensure they have the best version of the TTS model for use?

    -The user should always use the 2.0.2 version of the TTS model, as it is mentioned to be the best version, and avoid 2.0.3 which is not as good.

  • What additional resource is offered to help remember the steps for creating TTS voices?

    -A PDF graph is offered as an additional resource to help remember the steps for creating TTS voices. It is available for free on the creator's Patreon.

Outlines

00:00

🎙️ Introduction to Custom Text-to-Speech AI Voices

The video introduces a comprehensive guide on creating custom text-to-speech AI voices using one's own computer. The host, SK, promises to cover a range of methods from quick voice cloning to more sophisticated techniques. The video also provides two installation options: a one-click installer for patrons and a manual installation process requiring Python, FFMpeg, and C++ build tools. The manual method involves cloning repositories and installing necessary files via the command line.

05:02

🚀 Quick Cloning with 10 Seconds of Audio

The first method demonstrated is a quick voice cloning technique requiring only 10 seconds of an audio clip. Using the XTTS web UI, viewers are shown how to input text, select a language, upload a voice clip, and generate a text-to-speech output. The process is efficient and allows for a high character limit, enabling users to generate lengthy audio scripts from existing audio clips.

10:04

🤖 Training Your Own XTTS Model

The second method involves training a personalized text-to-speech model using just 2 minutes of audio. The process begins with uploading an audio file to the XTTS fine-tune web UI and creating a dataset. The model is then trained with default settings, which are optimized for best results. The fine-tuned model captures the nuances of the speaker's voice, including accent, speech patterns, and filler words, allowing for a highly personalized text-to-speech output.

15:06

🎧 The Ultimate Text-to-Speech Method

The final method, referred to as the 'Uber text-to-speech method,' combines text-to-speech generation with RVC (Reverse Voice Conversion) for enhanced voice quality. The process involves using a previously fine-tuned XTTS model to generate an audio file, which is then imported into RVC to create an even more authentic voice clone. The video also mentions an XTS RVC UI for automating this process. The result is a highly realistic and customizable text-to-speech AI voice that can be used without incurring high fees from third-party software.

Mindmap

Keywords

💡Text to Speech (TTS)

Text to Speech (TTS) is a technology that converts written text into audible speech. In the video, TTS is the central theme, as the host discusses various methods to create high-quality AI voices for TTS using different software and techniques. An example from the script is when the host mentions 'text to speech AI voices' and 'custom text to speech EA voices,' indicating the process of synthesizing speech from text.

💡Voice Cloning

Voice cloning refers to the process of replicating a person's voice using AI and machine learning. The video demonstrates how to clone a voice with just 10 seconds of audio, which is a significant part of the content. The host uses voice cloning to create a TTS model that mimics a specific speaker's voice, as shown when they use a clip from an Obama interview to create a cloned voice.

💡Local Text Speech AI

Local Text Speech AI implies running AI voice generation directly on a user's computer rather than relying on cloud-based services. The video emphasizes creating these AI voices locally to avoid fees and maintain control, as mentioned when the host says 'the best text speech AI voices on your local computer.'

💡FFMpeg

FFMpeg is a popular open-source multimedia framework that can handle various multimedia files. In the context of the video, FFMpeg is used as a tool for processing audio and video data. The host instructs viewers to install FFMpeg to facilitate the TTS processes, specifically mentioning 'launch the FFMpeg install as admin.'

💡Python

Python is a widely-used high-level programming language that is essential for developing and running the TTS models discussed in the video. The host mentions the need for Python when detailing the manual installation process, stating 'make sure that you have python G for Windows installed onto your computer.'

💡XTTS Web UI

XTTS Web UI refers to the graphical user interface for the XTTS (eXtreme Text-to-Speech) project, which is used to generate AI voices. The video script describes using the XTTS Web UI for tasks such as voice cloning and generating TTS audio, as in 'then launch the start xtts withi dobot file' and 'you're going to go inside the xtts withi folder.'

💡Training a Model

Training a model in the context of the video means teaching a machine learning algorithm to generate speech that resembles a specific voice using a dataset of audio recordings. The host explains how to train an XTTS model using just 2 minutes of audio, as referenced when they say 'we're going to train our own xtts model that's right, we're going to train our own text to speech model from scratch.'

💡RVC (Resemblyzer Voice Cloning)

RVC, short for Resemblyzer, is a voice conversion and cloning library used to create highly realistic voice replicas. The video showcases RVC for refining the generated TTS audio to make it sound even more like the original voice, as indicated by 'taking the generated audio from text to speech and putting it inside RVC to make it even better.'

💡Fine-tuning

Fine-tuning in machine learning involves training a model with a smaller dataset to adapt to a specific task. In the video, the host discusses fine-tuning an XTTS model using a short audio clip to improve the voice's accuracy, as seen in 'we're going to find tune our xtts, model using only 2 minutes of audio.'

💡One-click Installer

A one-click installer is a software installation method that automates the setup process with a single user action. The host mentions a one-click installer for Patreon supporters, which simplifies the installation of necessary software for TTS, as stated in 'that is available for my Pat supporters, just download the two links of to your computer then before the install you, need to actually launch the FFM Peg, install as admin.'

💡Patreon

Patreon is a membership platform that allows creators to receive financial support from their audience through monthly subscriptions. The host refers to Patreon as a way to support the channel and gain access to additional resources like a PDF guide, as mentioned in 'I will make the PDF available for free on my patreon in the description, down below.'

Highlights

Create custom text-to-speech (TTS) AI voices on your local computer for free.

Multiple methods available from quick 10-second voice cloning to the ultimate TTS voice.

Use one-click installer for Patreon supporters or manual installation for custom setups.

Install FFMpeg and C++ build tools for a complete TTS environment.

Clone a voice with just 10 seconds of audio using the XTTS web UI.

No character limit for input text, making it versatile for various applications.

Train your own TTS model with only 2 minutes of audio using the XTTS fine-tune web UI.

Fine-tuning allows capturing the speaker's accent, speech patterns, and unique vocal quirks.

RVC software enhances the cloned voice to near perfection.

The XTS-RVC UI automates the process of voice cloning and conversion.

Uber text-to-speech method combines the best of TTS and RVC for high-quality voice generation.

No limitations on the usage of your fine-tuned model post-training.

The process is cost-effective, eliminating the need for expensive third-party software.

Support and resources available for Patreon supporters, including priority assistance.

A PDF guide will be available for free on Patreon for easy reference.

The video provides a comprehensive guide to setting up and using the best TTS models locally.

Subscribe and support the channel for more informative and practical tech guides.