1 minute read

As speech generation research grows, many new models have emerged. For a comprehensive list, see open_tts_tracker. Similar to LLM evaluation, TTSArena (and its clone) provides rankings of popular systems.

While ELO-based ranking relies on human judgment, objective metrics are also commonly used for leaderboards. Though speech generation lacks a universal standard for objective evaluation, the community has developed several useful metrics:

  • Intelligibility - measured by speech recognition error rates on synthetic speech
  • Naturalness - predicted using models trained on human naturalness ratings
  • Similarity - for voice cloning systems, measured by cosine similarity between speaker embeddings of reference and generated speech

We would like to introduce:

speech_gen_eval (github)

An open-source library for objective evaluation of speech generation models

speech_gen_eval_testsets (huggingface)

A collection of test sets for evaluating speech generation models

speech_gen_baselines (huggingface)

A dataset containing synthetic speech from common models and their evaluation results

TTSLeaderboard (huggingface)

A leaderboard for speech generation models

There are other tools for objective evaluation of speech generation, including ZS-TTS-Evaluation, seed-tts-eval, tts-scores, evaluate-zero-shot-tts, and another leaderboard called TTSDS Scores. Our contribution focuses on:

  • Simple addition of new metrics and test sets
  • Support for various speech generation models (vocoders, zero-shot TTS, zero-shot VC, etc.), where tools can be reused
  • Mature engineering with easy installation and fast evaluation
  • Allowing to listen to the generated speech side-by-side to build a relation between metrics and perceived quality

Having most of these tools internally for a while, we are now making them publicly available.

We plan to expand the leaderboard with more models and add new metrics to the library. We welcome your suggestions and contributions!

Updated: