Voice Cloning with OpenVoice 2

January 22nd, 2026

As part of my personal goals in 2025, I investigated voice cloning and text-to-speech technology. Specifically, I focused on OpenVoice 2’s instant voice cloning for a personal project that revolved around audiobook narration and production. In this document, I recorded my findings in the format of a technical walkthrough.


The availability of AI-driven Text-to-Speech (TTS) and voice cloning technologies present creative opportunities in the audio production industry. This guide focuses on using OpenVoice 2, an advanced open source TTS and voice cloning tool, to develop high-fidelity output. The aim is to serve as a technical roadmap and first step towards creating essential post-production tools for audiobook narrators, hoping to enable seamless error correction and autogeneration of in-character dialog. The guide also details how voice cloning works in OpenVoice via the decoupling of tone, style, and content. This includes step-by-step instructions for installation, environment set up, and executing both basic TTS and instant voice cloning workflows. Finally, the guide will touch on challenges associated with the development of the guide.

Context & Motivation

While this guide is ultimately motivated by the long-term goal of supporting professional audiobook narrators, the technical focus is strictly on demonstrating high-fidelity instant voice cloning using OpenVoice 2. The guide breaks down the technology’s architecture, justifies the selection of OpenVoice, and provides all necessary instructions for installation, set up, and execution of cloning workflows.

Scope & Audience

This guide is structured as a practical technical roadmap and assumes a moderate level of familiarity with environments like the command line, Python scripting, and virtual environments.

Terminology & Understanding TTS

This section provides the foundational terms and concepts referenced throughout the guide.

Terms

Artifacts (Audio Artifacts): Unwanted or unnatural sounds that can sometimes appear in synthesized audio, often indicating a limitation of the model or input quality.

Checkpoints: Saved files that capture the trained state (weights and parameters) of a machine learning model at a specific point in time, necessary for running the model during inference.

Cross-Lingual Cloning: A model’s ability to speak new text in a target language while maintaining the cloned tone color of the reference speaker, even if the reference audio was in a different source language.

Inference: The process of using a trained machine learning model to make a prediction or generate an output, like synthesized speech, based on new input data.

Large Language Model (LLM): A type of AI model, often based on transformer architecture, trained on massive amounts of text data. While not the core of OpenVoice, LLMs often influence or are integrated with advanced TTS/voice cloning systems (like Bark’s token generation).

Model Architecture (Decoupled): In the context of voice cloning, a system design that physically separates the control over different speech components (e.g., separating the base TTS model from the tone color converter).

Phoneme: A distinct unit of sound in a spoken language.

Universal Phoneme System: The complete collection of phonemes for all languages that are supported by OpenVoice. Ideally this would include all phonemes in all languages.

Style/Prosody: The rhythmic and expressive elements of speech, including emotion, accent, rhythm, intonation, and pauses.

Text-to-Speech (TTS): The general technology used to synthesize human-sounding speech from written text input.

Tone Color (Timbre): The unique acoustic fingerprint or identity of a voice independent of pitch, volume, or content.

Voice Cloning: The process of creating an artificial voice model that accurately reproduces the unique acoustic qualities (timbre, tone) of a specific person’s voice, allowing the model to speak novel text.

Zero-Shot Cloning: The ability of a model to accurately clone a voice using only a single, short reference audio clip (often just a few seconds) without requiring specific training data for that speaker.

The Voice Cloning Strategy

For more specific information, the paper on OpenVoice is available here. This section will give a brief and simplified overview of how voice cloning works in OpenVoice and why it is so innovative in the field of text-to-speech.

OpenVoice is an instant voice cloning strategy that only requires a short audio clip from a reference speaker for generation of new speech in multiple languages. This is possible because it uses a decoupled framework that separates voice style and language from tone color. This removes the need for speaker-specific training.

Additionally, it uses a universal phoneme system to remove language dependencies and limitations. Other existing methods include use of the Massive Speaker Multi-Lingual (MSML) dataset in order to generate speech in a specific language. The phoneme system allows the OpenVoice method to pick out specific phonetics and generate output without reference to the MSML set. This means output is not reliant on the language being present in the dataset during training or generation.

The paper provides the following figure which demonstrates the generation process nicely.

Practical Implementation: Setting Up OpenVoice 2

While the implementation details are well documented within the paper, installation and usage documentation is a bit sparse. For reference, this is the documentation that the OpenVoice team provides for using their method. The goal of this section is to set up custom voice cloning with user input, where the documentation doesn’t go into enough detail.

Environment and Prerequisites

pip install -r requirements.txt

pip install git+https://github.com/myshell-ai/MeloTTS.git

python -m unidic download
  • If on Linux you will have to install ffmpeg
sudo apt install ffmpeg
  • You may have to download additional files (the script will indicate if you have to and how to do so, usually through a python terminal)
import nltk; nltk.download(...)

Hardware Requirements

A reasonable GPU would be helpful. A blog here benchmarked a few different GPUs and recommended an RTX 2070 or better. The one caveat is that the 40x series GPUs were not compatible at the time the blog post was written. That said, the script can work on a CPU as well; it just takes a large amount of time to execute (seconds vs. minutes).

Voice Generation and Cloning Workflows

In this section we will go through an example of how voice cloning works within OpenVoice v2. There are a number of sample speakers in the GithHub repository (in the resources folder) but for this example I wanted to use something that would be more recognizable. After considering different movie monologues I decided to go with Liam Neeson’s monologue from Taken.

Something important to note is that OpenVoice v2 is specifically for voice cloning. This means that the tone color extraction is where OpenVoice does all of its work. The tone color, however, does not control the emotion that the underlying base speaker model uses. MeloTTS (also by myshell-ai) is what we use here for the TTS pre-tone-cloning, which does not support emotion. There seems to be some disagreement between the two development communities (between MeloTTS and OpenVoice) on where emotion should fall in the diagram above. Using another emotionally colored TTS like Bark before applying tone coloring with OpenVoice may be a way around this.

All that is to say, if Liam Neeson doesn’t sound as angry or serious as he does in his monologue, the choice of underlying TTS may be a factor. Here is the reference audio I used for tone coloring. You’ll notice I did very little preprocessing. That was for two reasons. The first was because there was not a lot of distortion or background noise in the audio clip. The second was that I wanted to see how easy it would be to clone the voice with minimal effort. I didn’t want to use Ableton to remove the audio because that wouldn’t be something that everyone following this introduction would be expected to do. Voice isolation could be an entire guide to itself, but in this case it is left as an exercise for the reader.

The code

The file I will be referencing throughout the next two sections can be dropped into the top level of the OpenVoice directory we cloned earlier. I will copy portions of the code in the following subsections for reference. If you read through the demo sections in the openvoice repo (demo_part3.ipynb, in particular) you’ll notice that the code is mostly identical. I adapted the demo code to something that allows more rapid iteration/development from your preferred sh terminal.

Basic Text-to-Speech (TTS) Generation

Voice cloning is relatively independent of text-to-speech. Our goal in this section is to transform text into clear speech. The following code snipped pulled from the test file above demonstrates a simple text-to-speech conversion using a reference text_file.

# import the text-to-speech library

# this doesn't have to be meloTTS it can be any TTS that outputs a wav file

from melo.api import TTS

...

# grab the text from our text file

with open(text_file) as f:

    text = f.read()

...

model = TTS(language=language, device=device)

# grab the speaker ids from your model or select one manually

# in our case language="EN_NEWEST" corresponds to "EN-Newest"

speaker_ids = model.hps.data.spk2id

...

# write our tts intermediate file to src_path

# it's called src_path here because the output is used as the src for the tone color converter in the next section

model.tts_to_file(text, speaker_id, src_path, speed=speed)

Now we have our output file. In the test case we use the following input text (adapted from the demo text to have more of a Taken flavor):

Did you ever hear a folk tale about a giant turtle? ... That turtle has a very special set of skills... Waddling... Swimming... Chewing on seaweed... Don't make me get the turtle.

The ellipses are necessary in order to break up speech timing in MeloTTS. Other TTS options have ways of breaking speech up with laughter, pauses, and other vocal flare, but for the purposes of this introduction the goal is to keep it simple.

The intermediate audio output can be found here. On listening, you’ll notice that there is no tone cloning whatsoever. That is intentional. OpenVoice’s model works on the output of the TTS and applies tone to that intermediate audio, as we’ll see in the next section. That also means that once the basic TTS is done on each voice clip, the tone from multiple reference voices can be cloned onto those clips without reprocessing the input audio.

There are also noticeable artifacts at the end of some of the words. “Waddling,” “swimming,” and “seaweed” all seem to have an upward inflection that make them sound like they have an extra “-y” at the end: “seaweedy” is the most prominent of the three.

Instant Voice Cloning Procedure

Now that we have the intermediate audio output we can apply tone coloring to the sample.

import torch

from openvoice import se_extractor

from openvoice.api import ToneColorConverter

...

# checkpoint converter we downloaded from OpenVoice

ckpt_converter = 'checkpoints_v2/converter'

# determine if we're using gpu or cpu. both seemed to work relatively well for testing

device = "cuda:0" if torch.cuda.is_available() else "cpu"

# initialize the tone color converter from our checkpoints

tone_color_converter = ToneColorConverter(f'{ckpt_converter}/config.json', device=device)

tone_color_converter.load_ckpt(f'{ckpt_converter}/checkpoint.pth')

# get the audio input segments using the openvoice segment extractor 

target_se, _ = se_extractor.get_se(reference_file, tone_color_converter, vad=True)

# get the source segments from the checkpoints file we downloaded

...

source_se = torch.load(f'checkpoints_v2/base_speakers/ses/{speaker_key}.pth', map_location=device)

# encode a message in the audio so it can be identified as AI generated audio

encode_message = "@MyShell"

# run the tone color converter on the audio clip from the previous section

tone_color_converter.convert(

    audio_src_path=src_path, 

    src_se=source_se, 

    tgt_se=target_se, 

    output_path=save_path,

    message=encode_message

)

From this relatively short bit of code we get this output. Without any major adjustments or preprocessing, it is a pretty convincing representation of the input audio (from meloTTS) with the style of our reference audio.

Controlling Style and Delivery

Techniques for adjusting rhythm, emotion, and pace to ensure generated voice sections match existing narration would make the previous audio clip better. However, this is dependent on the quality of the input audio, not the tone coloring/cloning. There are text-to-speech engines like Bark that allow for emotional expression with TTS audio. The issue is that the checkpoints files, the input that OpenVoice expects, are tuned to the voices that meloTTS generates. This is where the previous version (v1) comes in. With the TTS used for that version you can specify style among several options and the checkpoint files are compatible with v2. In a more extensible version of the file that I created, you can specify v1 or v2 for either the tts or tone coloring.

usage: tone_color_tester.py [-h] [-r REFERENCE] [-o OUTPUT_DIR] [-t TEXT_FILE] [-l {EN_NEWEST,EN,ES,FR,ZH,JP,KR,en_newest,en,es,fr,zh,jp,kr}] [-s SPEED]

                            [-i INTERMEDIATE_FILE] [--tts-v1] [--tc-v1] [--style {friendly,cheerful,excited,sad,angry,terrified,shouting,whispering}]

optional arguments:

  -h, --help            show this help message and exit

  -r REFERENCE, --reference REFERENCE

                        reference mp3 file to use for tone coloring

  -o OUTPUT_DIR, --output-dir OUTPUT_DIR

                        output directory

  -t TEXT_FILE, --text-file TEXT_FILE

                        text input to TTS

  -l {EN_NEWEST,EN,ES,FR,ZH,JP,KR,en_newest,en,es,fr,zh,jp,kr}, --language {EN_NEWEST,EN,ES,FR,ZH,JP,KR,en_newest,en,es,fr,zh,jp,kr}

                        language for output

  -s SPEED, --speed SPEED

                        speed of output speech. where 1.0 is standard allowed range is 0.1-5.0. rounded to tenths

  -i INTERMEDIATE_FILE, --intermediate-file INTERMEDIATE_FILE

                        intermediate file to use instead of text to speec, will only run tone coloring on the supplied wav file

  --tts-v1              if flag is supplied script will use tts v1 instead of v2

  --tc-v1               if flag is supplied script will use openvoice v1 instead of v2

  --style {friendly,cheerful,excited,sad,angry,terrified,shouting,whispering}

                        style of voicing to use for tts

For example, you can run:

python3 tone_color_tester.py -r resources/taken.mp3 -t resources/input.txt --style whispering --tts-v1

Which will use resources/taken.mp3 as a reference for tone coloring, text from the input.txt file, and style it as whispering in TTS v1. This works fairly well for generating tone coloring with emotional expression.

Other Applications & Future Considerations

Intermediate file difficulties

Since the motivation of this guide is to support audiobook narration, I thought it would be best to separate the quality of text-to-speech from tone cloning and use a reference audio that already had emotion and inflection taken care of, then clone a tone onto that. The difficulty here is that there is no source segments file to reference. I got around this by forcing the segments to use the newest English segments provided in checkpoints v2. However, listening to the TTS raw output, it sounds nothing like the voice we’re trying to emulate. This results in a garbled mess as the output. If this were to be production-ized then there would have to be a way to generate the checkpoints file based on the intermediate input. This would likely be computationally expensive and drastically increase the time to run for the first input. After that, the checkpoints file could be used based on the user invoking the tone coloring.

AI Model Volatility

At the beginning of this project the plan was originally to use Coqui-AI’s TTS which had the capability to self-train models. Shortly after adopting and beginning work on this project, Coqui-AI announced that they were no longer supporting the product. This led to some scrambling to find another open source product that would do what the other claimed to, and the selection of OpenVoice. The TTS that Coqui-AI created is maintained in a fork, but this appearance and quick disappearance of emergent tech seems to plague the open source community surrounding AI. Whether it’s running out of funding or a lack of interest from supporters, a lot of new AI projects pop up only to fade into irrelevance. This can make it difficult to choose what library to learn when selecting among several unknown options.

Quality

Based on my experience toying with the various different open source tone manipulation tools, it feels like there is a pretty big disparity between the closed source market offerings for voice manipulation (see elevenlabs.io for a free example) and the open source options. Some of the quality issues can be ascribed to the choice of using a reference file for tone cloning, but it seems that more people (and corporations) are moving away from tone cloning and instead offering their own pregenerated voices. This was a problem when researching the text-to-speech offerings since training my own model was a big motivator for the project.

Conclusion & Future Plans

This was a fun foray into a section of computer science and software engineering for which I had no prior experience. While the outputs from this project may not be a production-ready innovation that allows audiobook narrators to cut their workload, it was a good experience for learning what the landscape of voice generation and manipulation looked like. I will continue to refine and develop the script I’ve written for tone color testing and incorporate other methods of training. As next steps I would like to incorporate the fork of Coqui-AI’s TTS and a segment generator for inputs to tone coloring. Figuring out how to dockerize the setup and still allow GPU access would also be a big upgrade to the usability of the project outputs. Additionally, I’d like to do more investigation into segment generation for input to tone coloring so that model training is easier.