Posts

2025 3
2024 1
2023 8

2025

Speech Generation Evaluation and Leaderboard

1 minute read

As speech generation research grows, many new models have emerged. For a comprehensive list, see open_tts_tracker. Similar to LLM evaluation, TTSArena (and i...

Tracing mHuBERT

1 minute read

Clustering of self-supervised embeddings is commonly used as “semantic” tokens in audio generation. Typically wav2vec 2.0 or HuBERT outputs are used. A 95M-p...

Super-resolution for TTS data

2 minute read

Zero-shot speech generation requires massive amounts of data. Large-scale speech datasets however are commonly collected for ASR and therefore sampled at 16k...

2024

Streaming Inference with Convolutional Layers

14 minute read

In this post, we explore how to apply convolutional layers to infinitely long inputs, specifically focusing on how to process inputs in chunks to minimize la...

2023

Dissecting BARK

5 minute read

Things started to get stale after the ubiquitous switch to Neural Text-to-Speech. A long-awaited leap forward was introduced thanks to ideas from the blossom...

Zero-shot speech generation benchmark

3 minute read

Synthesizing speech with a speaker identity not seen during training presents a significant challenge. Traditionally, achieving this required extensive train...

Українська мова в Balacoon

1 minute read

Швидкий, зручний та якісний нейромережевий синтез українського мовлення тепер в Balacoon. Інтеграція бібліотеки синтезу ще ніколи не була такою простою: Pyth...

Balacoon TTS on-device

3 minute read

Neural text-to-speech brought unprecedented improvements in the naturalness of synthetic speech. But it came with a cost. While parametric and concatenative ...

Balacoon TTS as a service

2 minute read

In recent years, text-to-speech technology has made tremendous strides, thanks in large part to advances in machine learning and artificial intelligence. As ...

Balacoon TTS version 0.1.0

2 minute read

We’re excited to announce the release of Balacoon TTS 0.1.0, the latest version of our text-to-speech package. This new version includes two major updates th...

Balacoon phonemeset

5 minute read

Text-to-speech assumes the implicit or explicit conversion of input text into a sequence of sounds to be pronounced. Defining a set of all possible sounds (o...

en-US abbreviation detection

10 minute read

Detecting abbreviations is crucial for proper text normalization and subsequent pronunciation generation. In broad terms, “abbreviation” means shortening, co...