In Text-to-Speech synthesis (TTS), Instant Voice Cloning (IVC) enables the TTS model to clone the voice of any reference speaker using a short audio sample, without requiring additional training for the reference speaker. This technique is also known as Zero-Shot Text-to-Speech Synthesis. The Instant Voice Cloning approach allows for flexible customization of the generated voice and demonstrates significant value across a wide range of real-world situations, including customized chatbots, content creation, and interactions between humans and Large Language Models (LLMs).
Although the current voice cloning frameworks do their job well, they are riddled with a few challenges in the field including Flexible Voice Style Control i.e models lack the ability to manipulate voice styles flexibly after cloning the voice. Another major roadblock encountered by current instant cloning frameworks is Zero-Shot Cross-Lingual Voice Cloning i.e for training purposes, current models require access to an extensive massive-speaker multi-lingual or MSML dataset irrespective of the language.
To tackle these issues, and contribute in the enhancement of instant voice cloning models, developers have worked on OpenVoice, a versatile instant voice cloning framework that replicates the voice of any user and generates speech in multiple languages using a short audio clip from the reference speaker. OpenVoice demonstrates Instant Voice Cloning models can replicate the tone color of the reference speaker, and achieve granular control over voice styles including accent, rhythm, intonation, pauses, and even emotions. What’s more impressive is that the OpenVoice framework also demonstrates remarkable capabilities in achieving zero-shot cross-lingual voice cloning for languages external to the MSML dataset, allowing OpenVoice to clone voices into new languages without extensive pre-training for that language. OpenVoice manages to deliver superior instant voice cloning results while being computationally viable with operating costs up to 10 times less that current available APIs with inferior performance.
In this article, we will talk about the OpenVoice framework in depth, and we will uncover its architecture that allows it to deliver superior performance across instant voice cloning tasks. So let’s get started.
As mentioned earlier, Instant Voice Cloning, also referred to as Zero-Shot Text to Speech Synthesis, allows the TTS model to clone the voice of any reference speaker using a short audio sample without the need of any additional training for the reference speaker. Instant Voice Cloning has always been a hot research topic with existing works including XTTS and VALLE frameworks that extract speaker embedding and/or acoustic tokens from the reference audio that serves as a condition for the auto-regressive model. The auto-regressive model then generates acoustic tokens sequentially, and then decodes these tokens into a raw audio waveform.
Although auto-regressive instant voice cloning models clone the tone color remarkably, they fall short in manipulating other style parameters including accent, emotion, pauses, and rhythm. Furthermore, auto-regressive models also experience low inference speed, and their operational costs are quite high. Existing approaches like YourTTS framework employ a non-autoregressive approach that demonstrates significantly faster inference speech over autoregressive approach frameworks, but are still unable to provide their users with flexible control over style parameters. Moreover, both autoregressive-based and non-autoregressive based instant voice cloning frameworks need access to a large MSML or massive-speaker multilingual dataset for cross-lingual voice cloning.
To tackle the challenges faced by current instant voice cloning frameworks, developers have worked on OpenVoice, an open source instant voice cloning library that aims to resolve the following challenges faced by current IVC frameworks.
- The first challenge is to enable IVC frameworks to have flexible control over style parameters in addition to tone color including accent, rhythm, intonation, and pauses. Style parameters are crucial to generate in-context natural conversations and speech rather than narrating the input text monotonously.
- The second challenge is to enable IVC frameworks to clone cross-lingual voices in a zero-shot setting.
- The final challenge is to achieve high real-time inference speeds without deteriorating the quality.
To tackle the first two hurdles, the architecture of the OpenVoice framework is designed in a way to decouple components in the voice to the best of its abilities. Furthermore, OpenVoice generates tone color, language, and other voice features independently, enabling the framework to flexibly manipulate individual language types and voice styles. The OpenVoice framework tackles the third challenge by default as the decoupled structure reduces computational complexity and model size requirements.
OpenVoice : Methodology and Architecture
The technical framework of the OpenVoice framework is effective and surprisingly simple to implement. It is no secret that cloning the tone color for any speaker, adding new language, and enabling flexible control over voice parameters simultaneously can be challenging. It is so because executing these three tasks simultaneously requires the controlled parameters to intersect using a large chunk of combinatorial datasets. Furthermore, in regular single speaker text to speech synthesis, for tasks that do not require voice cloning, it is easier to add control over other style parameters. Building on these, the OpenVoice framework aims to decouple the Instant Voice Cloning tasks into subtasks. The model proposes to use a base speaker Text to Speech model to control the language and style parameters, and employs a tone color converter to include the reference tone color into the voice generated. The following figure demonstrates the architecture of the framework.
At its core, the OpenVoice framework employs two components: a tone color converter, and a base speaker text to speech or TTS model. The base speaker text to speech model is either a single-speaker or a multi-speaker model allowing precise control over style parameters, language, and accent. The model generates a voice that is then passed on to the tone color converter, that changes the base speaker tone color to the tone color of the reference speaker.
The OpenVoice framework offers a lot of flexibility when it comes to the base speaker text to speech model since it can employ the VITS model with slight modification allowing it to accept language and style embeddings in its duration predictor and text encoder. The framework can also employ models like Microsoft TTS that are commercially cheap or it can deploy models like InstructTTS that are capable of accepting style prompts. For the time being, the OpenVoice framework employs the VITS model although the other models are also a feasible option.
Coming to the second component, the Tone Color Converter is an encoder-decoder component housing an invertible normalizing flow in the center. The encoder component in the tone color converter is a one-dimensional CNN that accepts the short-time fourier transformed spectrum of the base speaker text to speech model as its input. The encoder then generates feature maps as output. The tone color extractor is a simple two-dimensional CNN that operates on the mel-spectrogram of the input voice, and generates a single feature vector as the output that encodes the information of the tone color. The normalizing flow layers accept the feature maps generated by the encoder as the input and generate a feature representation that preserves all style properties but eliminates the tone color information. The OpenVoice framework then applies the normalizing flow layers in the inverse direction, and takes the feature representations as the input and outputs the normalizing flow layers. The framework then decodes the normalizing flow layers into raw waveforms using a stack of transposed one-dimensional convolutions.
The entire architecture of the OpenVoice framework is feed forward without the use of any auto-regressive component. The tone color converter component is similar to voice conversion on a conceptual level but differs in terms of functionality, training objectives, and an inductive bias in the model structure. The normalizing flow layers share the same structure as flow-based text to speech models but differ in terms of functionality and training objectives.
Furthermore, there exists a different approach to extract feature representations, the method implemented by the OpenVoice framework delivers better audio quality. It is also worth noting that the OpenVoice framework has no intention of inventing components in the model architecture, rather both the main components i.e. the tone color converter and the base speaker TTS model are both sourced from existing works. The primary aim of the OpenVoice framework is to form a decoupled framework that separates the language control and the voice style from the tone color cloning. Although the approach is quite simple, it is quite effective especially on tasks that control styles and accents, or new language generalization tasks. Achieving the same control when employing a coupled framework requires a large amount of computing and data, and it does not generalize well to new languages.
At its core, the main philosophy of the OpenVoice framework is to decouple the generation of language and voice styles from the generation of tone color. One of the major strengths of the OpenVoice framework is that the clone voice is fluent and of high quality as long as the single-speaker TTS speaks fluently.
OpenVoice : Experiment and Results
Evaluating voice cloning tasks is a hard objective due to numerous reasons. For starters, existing works often employ different training and test data that makes comparing these works intrinsically unfair. Although crowd-sourcing can be used to evaluate metrics like Mean Opinion Score, the difficulty and diversity of the test data will influence the overall outcome significantly. Second, different voice cloning methods have different training data, and the diversity and scale of this data influences the results significantly. Finally, the primary objective of existing works often differs from one another, hence they differ in their functionality.
Due to the three reasons mentioned above, it is unfair to compare existing voice cloning frameworks numerically. Instead, it makes much more sense to compare these methods qualitatively.
Accurate Tone Color Cloning
To analyze its performance, developers build a test set with anonymous individuals, game characters and celebrities form the reference speaker base, and has a wide voice distribution including both neutral samples and unique expressive voices. The OpenVoice framework is able to clone the reference tone color and generate speech in multiple languages and accents for any of the reference speakers and the 4 base speakers.
Flexible Control on Voice Styles
One of the objectives of the OpenVoice framework is to control the speech styles flexibly using the tone color converter that can modify the color tone while preserving all other voice features and properties.
Experiments indicate that the model preserves the voice styles after converting to the reference tone color. In some cases however, the model neutralizes the emotions slightly, a problem that can be resolved by passing less information to the flow layers so that they are unable to get rid of the emotion. The OpenVoice framework is able to preserve the styles from the base voice thanks to its use of a tone color converter. It allows the OpenVoice framework to manipulate the base speaker text to speech model to easily control the voice styles.
Cross-Lingual Voice Clone
The OpenVoice framework does not include any massive-speaker data for an unseen language, yet it is able to achieve near cross-lingual voice cloning in a zero-shot setting. The cross-lingual voice cloning capabilities of the OpenVoice framework are two folds:
- The model is able to clone the tone color of the reference speaker accurately when the language of the reference speaker goes unseen in the multi-speaker multi language or MSML dataset.
- Furthermore, in the same event of the language of the reference speaker goes unseen, the OpenVoice framework is capable of cloning the voice of the reference speaker, and speak in the language one the condition that the base speaker text to speech model supports the language.
Final Thoughts
In this article we have talked about OpenVoice, a versatile instant voice cloning framework that replicates the voice of any user and generates speech in multiple languages using a short audio clip from the reference speaker. The primary intuition behind OpenVoice is that as long as a model does not have to perform tone color cloning of the reference speaker, a framework can employ a base speaker TTS model to control the language and the voice styles.
OpenVoice demonstrates Instant Voice Cloning models can replicate the tone color of the reference speaker, and achieve granular control over voice styles including accent, rhythm, intonation, pauses, and even emotions. OpenVoice manages to deliver superior instant voice cloning results while being computationally viable with operating costs up to 10 times less that current available APIs with inferior performance.