Zero-shot speech generation benchmark
Synthesizing speech with a speaker identity not seen during training presents a significant challenge. Traditionally, achieving this required extensive training on many speakers to ensure a continuous speaker space[1]. The most performant methods, such as RVC, still need minimal fine-tuning with ~10 minutes of target speaker data to achieve reasonable quality. However, the approaches leveraging the power of big models are gaining momentum. For instance, Microsoft’s VALL-E[2] boldly claims to clone a speaker’s voice with just 3 seconds of speech as a reference. In this blog post, we aim to present a benchmark of voice conversion technologies, comparing Revoice to the widely spread zero-shot VC baselines.
Testsets
Typical evaluations of Voice Conversion systems rely on objective metrics collected from running conversion on unseen multi-speaker corpora.
We design the evaluation to be insightful for the Revoice
use-case. We use multi-speaker corpora as a source or input audio and
a library of speakers from Revoice app as a target or reference audio. Input audio is derived from:
- VCTK - classical voice conversion benchmark. Clean recordings, multiple accents.
- DAPS corpus[3] - emulated mobile device recordings in various conditions. This dataset resembles the audio quality we obtain as a Voice Conversion service more closely.
Metrics
We measure three model-based objective metrics for the converted speech:
- Speaker similarity: we measure a cosine distance between a latent speaker representation from converted speech and reference audio. We use ECAPA[4] speaker encoder by Speechbrain to extract speaker representation.
- Speech intelligibility: we run speech recognition with Conformer-Transducer ASR model by NVIDIA on the converted speech and measure the Character Error Rate with respect to the transcription.
- Naturalness: we use a pre-trained Mean Opinion Score estimator UTMOS[5] released by the authors.
Baselines
We select two widely-spread systems as the baselines. Both are trained on a large number of speakers and are capable of zero-shot speech generation.
- YourTTS[6] (from 2021) is a VITS architecture model with adjustments trained on VCTK + LibriTTS datasets. It uses an invertible normalizing flow to disentangle speaker identity from the spectrogram representation. Handy tutorial on how to run it can be found here.
- BARK (from 2022) is a large (350M parameters) decoder-only transformer that generates speech from “semantic tokens.” Those are self-supervised representations extracted with HuBERT[7] that effectively disentangle content (semantics) and speaker characteristics. Running Voice Conversion with BARK is not straightforward, because extraction of semantic tokens is not released. Suno.ai only provides prediction of semantic tokens from text. Fortunately, there is a community contributed semantic tokens extractors that are compatible with BARK. This addition allows to create own voice profiles and perform voice conversion, adjusting semantic tokens and voice profiles in this notebook.
The autoregressive transformer decoder in BARK is significantly slower than parallel conversion in YourTTS, but it has greater potential due to the model’s scalability.
Results
We present results of the evaluations in the tables below.
Here is performance of the systems on VCTK
:
Model | Naturalness(MOS↑) | Intelligibility(CER, %↓) | Similarity(inverted cosine distance↓) |
---|---|---|---|
no model | 4.06 | 0.17 | - |
YourTTS* | 3.21 | 1.08 | 0.613 |
BARK | 3.49 | 2.58 | 0.692 |
Revoice | 3.45 | 1.36 | 0.614 |
And performance on DAPS
:
Model | Naturalness(MOS↑) | Intelligibility(CER, %↓) | Similarity(inverted cosine distance↓) |
---|---|---|---|
no model | 2.39 | 2.755 | - |
YourTTS | 2.08 | 26.7 | 0.655 |
BARK | 2.85 | 14.77 | 0.738 |
Revoice | 2.81 | 16.56 | 0.564 |
Small example of how systems actually sound. For the these inputs:
The systems produce following outputs:
YourTTS shows excellent performance on VCTK
but degrades significantly on more noisy inputs.
BARK consistently delivers clean and intelligible audio, but the speaker similarity lags.
Revoice competes with BARK in terms of naturalness and intelligibility while making a leap
forward in terms of speaker similarity.
References
[1] Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
[2] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
[3] Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech? — A Dataset, Insights, and Challenges
[4] ECAPA-TDNN Embeddings for Speaker Diarization
[5] UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022
[6] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
[7] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
*
YourTTS uses VCTK
in training, which might give slightly overly optimistic results.