HierSpeech++ : Hierarchical Variational Inference for Zero-shot Speech Synthesis

The recent developments and the progress in the capabilities of large language models have played a crucial role in the advancements of LLM-based frameworks for audio generation and speech synthesis tasks especially in the zero-shot setting. Traditional speech synthesis frameworks have witnessed significant advancements as a result of integrating additional features like neural audio codecs for discreet audio and speech units. Even though these speech and audio synthesis frameworks deliver satisfactory results, there is still room for improvement as the current LLM-based audio frameworks have the following three major limitations

They tend to auto-generate audio output that ultimately causes a lack of robustness and slow interference speeds and results in mispronunciation, skipping, or repeating.
They tend to over-rely on discrete speech units or pre-trained neural audio codec.
They often require a large amount of training data.

To tackle the issues mentioned above, and improve the capabilities of LLM-based audio and speech synthesis models, developers have come up with HierSpeech++, a robust and efficient zero-shot speech synthesizer for voice and text to speech or TTS conversions. The HierSpeech++ framework builds upon the learnings of hierarchical speech synthesis frameworks that not only boosts the robustness, but also adds to the expressiveness of synthetic speech output while also boosting the naturalness and speaker similarity of artificially generated speech even in a zero-shot setting.

In this article, we will be talking about the HierSpeech++ framework in detail, and have a look at the model’s architecture, working, and results when compared against state of the art text and audio generation models. So let’s get started.

The HierSpeech++ is a fast, robust, and efficient zero-shot speech synthesis framework that uses a hierarchical speech synthesis pipeline, and by adopting this end to end speech synthesis framework, the HierSpeech++ model is able to maximize the potential of high-quality waveform generation to hierarchically bridge the gap between semantic and acoustic representations by adopting a self-supervised speech representation as a semantic speech representation, and thus attempts to solve the current limitations of style adaptations. The end to end speech synthesis framework was first introduced by the VITS model, and it adopts a VAE or Variational Auto-Encoder augmented with adversarial training and normalizing flow. Furthermore, VAE-based frameworks with an end to end training pipeline have the capability to generate high-quality waveform audio with the perceptual speech synthesis quality being significantly better than the ones generated by other speech synthesis frameworks.

The audio reconstruction quality of these frameworks can be enhanced further by using a hierarchical conditional Variational AutoEncoder as used in the HierSpeech framework. Despite their potential, end to end training pipeline based models have certain limitations especially in a zero-shot setting as even though they can synthesize speech samples with high-quality audio, the speaker similarity in zero-shot voice cloning tasks is still riddled with high computational complexity. On the other hand, diffusion-based speech synthesis models perform well in terms of speaker adaptations but they are still far from perfect as they make use of an interactive generation process that slows down its inference speed, they are often vulnerable to noisy data, and as a result of the mismatch between training and inference of the two-stage generation process between the Mel-spectrogram and generated ground-truth the audio quality is not up to the mark.

To tackle the issues faced by its predecessors, the HierSpeech++ model employs a hierarchical speech synthesizer, a speech super-resolution, and a text to vec component, and introduces an improved hierarchical speech synthesizer built on the hierarchical conditional VAE or Variational AutoEncoder. In an attempt to enhance the audio quality beyond the perceptual quality, the HierSpeech++ framework adopts a dual-audio to boost the acoustic posterior, and enhances out of distribution generalization by employing a hierarchical adaptive generator equipped with both conditional and unconditional generation. Furthermore, to disentangle speech components, and enhance speaker-related & speaker-agnostic semantic information, the HierSpeech++ framework also adopts a source-filter theory-based multi-path semantic encoder. As a result of employing a Variational AutoEncoder, the HierSpeech++ model can connect and learn representations hierarchically, and progressively adapt to the target voice style to infer the waveform audio. Additionally, the HierSpeech++ framework also deploys a bidirectional network of normalizing flow Transformers in an attempt to enhance adaptation, and also reduce the mismatch between training and inference.

Overall, the HierSpeech++ model is a fully-parallel, novel and robust hierarchical speech synthesis framework aimed at synthesizing speech samples in a zero-shot setting, and attempts to make the following contributions

Using a hierarchical speech synthesis framework to control and transfer voice styles and prosody.
Enable data scalability, and high-resolution speech synthesis by upsampling the waveform audio from 16 to 48 kHz.
Achieve human-level ability across zero-shot voice conversion and text-to-speech tasks.

HierSpeech++ : Model Components and Architecture

As discussed, HierSpeech++ is a zero-shot speech synthesis model that attempts to achieve human-level accuracy in terms of voice similarity and speech naturalness.

The HierSpeech++ model consists of different components including a hierarchical speech synthesizer, a speech super resolution, and text-to-vec to TTV that work in sync with one another to facilitate the training of each model that can effectively utilize a large amount of low-resolution speech data for voice cloning. Let’s break down the framework, and talk about each component.

Speech Representations

As the human frequency band is under 4 kHz, for speech synthesis, the HierSpeech++ framework downsamples the audio at 16 kHz. Furthermore for reconstructing the voice signal, it is vital to use at least double the highest component of voice frequency in addition to downsampling the audio sample. To attain enhanced perceptual quality, the HierSpeech++ framework makes use of a speech super resolution or SpeechSR component to upsample the audio sample from 16 to 48 kHz, and makes use of low-resolution representations for semantic and acoustic representations.

For acoustic representations, a traditional text to speech or TTS framework employs a Mel-spectrogram as its intermediate acoustic feature that is then transformed from the waveform with the help of a STFT or Short-Time Fourier Transform. However, it is worth noting that since acoustic features are rich representations comprising various attributes including content and pronunciation, voice information, and more that makes it difficult for the framework to infer these representations, a situation that often leads to mispronunciations, lack of similarity, or over-smoothing of the speech.

Moving along, to extract a continuous semantic representation from a waveform, the HierSpeech++ framework utilizes a Wav2Vec framework in contrast to the popular self-supervised speech representation approach for semantic representations. Although the approach does make a good alternative for a rich monolingual model, it affects the zero-shot voice cloning abilities of a model in terms of both robustness and expressiveness especially on multilingual speech synthesis tasks.

Hierarchical Speech Synthesizer

The Hierarchical Speech Synthesizer component is the foundation stone for the HierSpeech++ framework as it allows training the module without using any labels like text transcripts or speaker id, and relying solely on speech data. To increase the acoustic capacity, previous state of the art speech synthesis models replaced the Mel-spectrogram with a linear spectrogram, however, the approach minimizes the KL divergence score in terms of pitch periodicity, PESQ, voice and unvoice score, and even Mel-spectrogram distance. The Hierarchical Speech Synthesizer employs a Dual-audio Acoustic Encoder to solve the challenges presented by using a linear spectrogram designed to capture richer and more comprehensive acoustic representations. The framework also employs a waveform encoder to distill information from a raw waveform audio, and concatenates it with the linear spectrogram representation, and finally projects the acoustic representation as a concatenated representation.

Furthermore, to deal with speaker-agnostic, and speaker-related semantic representations, the HierSpeech++ framework utilizes a multi-path self-supervised speech representation where each individual representation is used for hierarchical style adaptation with the semantic representations extracted to obtain linguistic information from the middle layer of the MMS. The framework also utilizes a fundamental frequency to enhance speech disentanglement that enables controlling the pitch contour manually. The framework also uses a linguistic representation as conditional information to generate waveform audio hierarchically, and uses an enhanced linguistic representation of the self-supervised representation. It is also worth noting that the acoustic representations extracted during training by using a waveform and linear spectrogram is used to reconstruct the raw waveform audio, and a hierarchical variational inference is used to link the acoustic representations with the multi-path linguistic representations. The framework also employs a hierarchical adaptive generator(HAG) to generate semantic-to-waveform samples, and the generated representations comprising a style representation, and an acoustic representation are fed to the source and waveform generators.

Text to Vec

For text to speech synthesis, the HierSpeech++ framework employs a text to vec or TTV model that generates a fundamental frequency and a semantic representation from a text sequence, and utilizes a monotonic alignment search coupled with a variational autoencoder to align the speech and text internally. The HierSpeech++ framework then replaces the linear spectrogram with a self-supervised linear representation, and reconstructs the same representation to serve as the output for the TTV.

Additionally, the HierSpeech++ framework predicts the fundamental frequency with four times larger resolutions when compared to the self-supervised speech representations, and makes use of a conditional text representation as the prior information. As a result of the semantic information of self-supervised speech representations, the framework is capable of transferring the prosody style in the text to vec model, and feeds a latent representation to the phoneme encoder to enhance the linguistic capabilities of the representation.

SpeechSR or Speech Super Resolution

The HierSpeech++ framework trains on a relatively low-resolution dataset in terms of data efficiency and availability, and up-samples a low-resolution speech waveform to a high-resolution speech waveform from 16 to 48 kHz. The framework also replaces a transposed convolution with the nearest neighbor upsampler that has previously been known to alleviate artifacts as a result of transposed convolutions.

Architecture

The content encoder of the text to vec model consists of 16 non-casual WaveNet layers with a kernel size of 5 and a hidden size of 256 whereas the content decoder consists of 8 non-casual WaveNet layers with a kernel size of 5, and a hidden size of 512. The text encoder component consists of three prosody conditional Transformer networks and three unconditional Transformer networks with a kernel size of 9, filter size of 1024, and a hidden size of 256 with the text encoder having a dropout rate of 0.2. To encode adjacent information, and to enhance prosody style adaptation, the framework adopts a CNN with a kernel size of 5 in Transformer blocks. The SpeechSR on the other hand comprises a single AMP block with 32 initial channels without the presence of an upsampling layer. The framework makes use of a nearest neighbor upsampler to upsample the hidden representations and utilizes a MPD as the discriminator with six different window sizes, and four sub-band discriminators.

The above figure demonstrates the inference pipeline of the HierSpeech++ framework that starts with extracting the semantic representations from the audio at a frequency of 16 kHz, and at the fundamental frequency by making use of the YAPPT algorithm. Before the fundamental frequency can be fed to the Hierarchical Synthesizer, it is normalized using the standard and mean deviations of the source audio, and the normalized fundamental frequency is then denormalized by using the standard and mean deviation of the target audio. For text to speech extractions, the HierSpeech++ framework extracts textual representations instead of speech representations, and employs the text to vec model to generate a semantic representation from the prosody prompt.

Experiment and Results

The framework utilizes the publicly available LibriTTS dataset to train the hierarchical synthesizer component with the first step being training the model with trainclean subsets of the dataset, and utilizing the remaining data to enable enhanced transfer of the voice style. Additionally, to improve the diversity and robustness, the framework upscales the dataset to 1 kHz as demonstrated in the following figure.

Reconstruction, Resynthesis Tasks, and Voice Conversion

To evaluate the performance of the HierSpeech++ framework on reconstruction and resynthesizing tasks, developers conducted seven objective metrics, and the results are demonstrated in the following figures for reconstruction and resynthesizing tasks respectively.

For Voice Conversion tasks, the framework uses two subjective metrics for evaluation: voice similarity MOS or sMOS and naturalness mean opinion score of nMOS with three naturalness objective metrics, and two similarity objective metrics.

Moving along, the primary aim of the HierSpeech++ framework is to enable zero-shot speech synthesis, and to evaluate its performance in zero-shot, it is compared against other basemodels like AutoVC, VoiceMixer, Diffusion-based models, and a lot more with the results being demonstrated in the following figure.

The following figures demonstrate the zero-shot text to speech results with noisy prompts, and very noisy prompts respectively.

Final Thoughts

In this article, we have talked about the HierSpeech++ model, a novel approach to enable robust, and effective speech synthesis in a zero-shot setting, and overcome the limitations faced by current speech synthesis frameworks including their over-reliance on large amounts of training data, reliance on discrete speech units or pre-trained neural audio codec, and their tendency to auto-generate audio output that ultimately causes a lack of robustness and slow interference speeds and results in mispronunciation, skipping, or repeating. The HierSpeech++ model is a fully-parallel, novel and robust hierarchical speech synthesis framework aimed at synthesizing speech samples in a zero-shot setting, and attempts to make the following contributions

Using a hierarchical speech synthesis framework to control and transfer voice styles and prosody.
Enable data scalability, and high-resolution speech synthesis by upsampling the waveform audio from 16 to 48 kHz.
Achieve human-level ability across zero-shot voice conversion and text-to-speech tasks.

HierSpeech++ : Hierarchical Variational Inference for Zero-shot Speech Synthesis