Speech Generation Evaluation and Leaderboard

1 minute read

As speech generation research grows, many new models have emerged. For a comprehensive list, see open_tts_tracker. Similar to LLM evaluation, TTSArena (and its clone) provides rankings of popular systems.

While ELO-based ranking relies on human judgment, objective metrics are also commonly used for leaderboards. Though speech generation lacks a universal standard for objective evaluation, the community has developed several useful metrics:

Intelligibility - measured by speech recognition error rates on synthetic speech
Naturalness - predicted using models trained on human naturalness ratings
Similarity - for voice cloning systems, measured by cosine similarity between speaker embeddings of reference and generated speech

We would like to introduce:

`speech_gen_eval` (github)

An open-source library for objective evaluation of speech generation models

`speech_gen_eval_testsets` (huggingface)

A collection of test sets for evaluating speech generation models

`speech_gen_baselines` (huggingface)

A dataset containing synthetic speech from common models and their evaluation results

`TTSLeaderboard` (huggingface)

A leaderboard for speech generation models

There are other tools for objective evaluation of speech generation, including ZS-TTS-Evaluation, seed-tts-eval, tts-scores, evaluate-zero-shot-tts, and another leaderboard called TTSDS Scores. Our contribution focuses on:

Simple addition of new metrics and test sets
Support for various speech generation models (vocoders, zero-shot TTS, zero-shot VC, etc.), where tools can be reused
Mature engineering with easy installation and fast evaluation
Allowing to listen to the generated speech side-by-side to build a relation between metrics and perceived quality

Having most of these tools internally for a while, we are now making them publicly available.

We plan to expand the leaderboard with more models and add new metrics to the library. We welcome your suggestions and contributions!

Share on

Twitter Facebook LinkedIn

Balacoon

Speech Generation Evaluation and Leaderboard

We would like to introduce:

`speech_gen_eval` (github)

`speech_gen_eval_testsets` (huggingface)

`speech_gen_baselines` (huggingface)

`TTSLeaderboard` (huggingface)

Share on

You May Also Enjoy

Tracing mHuBERT

Super-resolution for TTS data

Streaming Inference with Convolutional Layers

Dissecting BARK

Balacoon

We would like to introduce:

speech_gen_eval (github)

speech_gen_eval_testsets (huggingface)

speech_gen_baselines (huggingface)

TTSLeaderboard (huggingface)

Share on

You May Also Enjoy

Tracing mHuBERT

Super-resolution for TTS data

Streaming Inference with Convolutional Layers

Dissecting BARK

`speech_gen_eval` (github)

`speech_gen_eval_testsets` (huggingface)

`speech_gen_baselines` (huggingface)

`TTSLeaderboard` (huggingface)