3 minute read

Neural text-to-speech brought unprecedented improvements in the naturalness of synthetic speech. But it came with a cost. While parametric and concatenative speech synthesis systems produce tens of seconds of audio in just 1 second of wall time (they deliver >10 xRT*) on a single CPU core, neural TTS requires way more computational power. You often need a GPU to provide compelling latency for responsive applications. Fortunately, when there is a will, there is a way. Let’s dive into on-device Neural TTS and see what Balacoon has to offer.

On-device Neural TTS recap

Several milestones of Neural TTS evolution are worth mentioning in this regard. Generating raw waveform is the most computationally expensive part of synthesis. WaveRNN[1] from Google pioneered real-time synthesis on CPU. The authors used sparsification (dropping most neural network weights) and subscaling (generating multiple samples simultaneously) to achieve remarkable results. Later LPCNet[2] brought these advances, as well as an idea of mixing signal processing with neural networks, to the public. And finally, in a trend of GAN-based vocoding overtaking the domain, MB-MelGAN[3] came forward by breaking the curse of auto-regressive waveform generation.

Acoustic features prediction was a less acute problem and down-scaled reasonably well. The most widely spread FastSpeech2[4] already has only 30M parameters and runs reasonably fast. And with LightSpeech[5], Microsoft has shown that it is possible to shrink it down to 2M parameters.

So once VITS[6] and JETS[7] paved the way to end-to-end speech synthesis, combining acoustic features prediction and vocoding, it was already clear that low resource end-to-end TTS is just around the corner. Indeed NIX-TTS[8] came into the game, squishing the whole Neural TTS backend into 5M parameters that run 0.5xRT on a Raspberry PI 3B.

Implementations available

While LPCNet is not so widely used anymore, it is worth mentioning because the implementation contains valuable engineering insights, such as sparsification and vectorization. TensorFlowTTS combines mentioned FastSpeech2 and MB-MelGAN in an android example powered by TFLite. Nix-TTS authors release their code and models. And lastly, there is Piper, which competes with Nix-TTS in terms of performance (also 5M parameters models), but instead of distillation, it simply downscales VITS architecture.

Introducing Light💡

We composed our own version of the lightweight TTS model called Light. It has fewer parameters compared to default JETS models. Therefore it compromises quality and multi-speaker, multi-lingual capabilities. It also delivers only 16kHz audio instead of 24kHz. On the other hand, it provides an order of magnitude faster synthesis on the CPU.

Degradation compared to full-scale model on the held-out test set of “92” Hi-Fi speaker:

Model Naturalness (MOS↑) Intelligibility (CER, %↓)
recordings 3.92 0.32
en_us_hifi_jets_cpu.addon 4.0 0.28
en_us_hifi92_light_cpu.addon 3.89 0.32

Synthesis speed on AMD Ryzen Threadripper 1950X:

Model/System faster than real-time (xRT↑)
en_us_hifi_jets_cpu.addon 6.02
Piper (ljspeech) 29.15
en_us_hifi92_light_cpu.addon 50.86

Synthesis speed on Raspberry PI 3B with Cortex-A53:

Model/System faster than real-time (xRT↑)
Piper (ljspeech) 1.13
en_us_hifi92_light_cpu.addon 2.33

You can try out en_us_hifi92_light_cpu.addon in our huggingface space and use it with balacoon_tts python package as described in a tutorial.

References

[1] Efficient Neural Audio Synthesis

[2] LPCNet: Improving Neural speech synthesis through linear prediction

[3] Multi-Band MelGAN: Faster waveform generation for high-quality Text-to-Speech

[4] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

[5] LightSpeech: Lightweight and fast Text-to-Speech with Neural Architecture Search

[6] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

[7] JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

[8] NIX-TTS: Lightweight and end-to-end Text-to-Speech via module-wise distillation


* There is a certain confusion around “xRT” (times real-time) terminology. Some people mean “how much audio is produced in one second of walltime”; others refer to “how much time it takes to synthesize one second of audio”. While the latter is generally more popular, I stick with the former because numbers like “30xRT” and “50xRT” are easier to comprehend and compare than “0.033xRT” and “0.02xRT”.

Updated: