Balacoon

Streaming Inference with Convolutional Layers

2024-04-20T00:00:00+00:00

In this post, we explore how to apply convolutional layers to infinitely long inputs, specifically focusing on how to process inputs in chunks to minimize latency. For instance, in text-to-speech applications, instead of synthesizing an entire sentence at once, we prefer to generate and play back audio in segments. While recurrent or autoregressive networks are inherently causal and thus well-suited for streaming processing, convolutional layers present more challenges and require careful handling.

Conv1d

First, let’s examine a standard convolutional layer. By default, convolutions are non-causal, meaning the output at any given time may depend on both past and future input values.

Non-causal convolution

To achieve output of the same size as the input, we pad the input on both sides by the receptive_field of the convolution layer, defined as kernel_size // 2:

import torch

x = torch.randn(1, 1, 100)  # (batch x channels x time)
kernel_size = 7
receptive_field = kernel_size // 2
non_causal_conv_layer = torch.nn.Conv1d(
    1,  # input channels
    1,  # output channels
    kernel_size,
    bias=False,
    padding=receptive_field,
)

y = non_causal_conv_layer(x)
assert x.shape == y.shape

For chunked inference, padding must be applied manually, and the input shifted by chunk_size - 2 * receptive_field for each subsequent chunk.

Non-causal convolution in chunks

This can be implemented as follows:

non_causal_chunk_conv_layer = torch.nn.Conv1d(
    1,  # input channels
    1,  # output channels
    kernel_size,
    bias=False,
    padding=0,  # we will do padding manually
)
# copy the weights from the original conv layer
non_causal_chunk_conv_layer.weight = non_causal_conv_layer.weight
# pad the input by receptive field on both sides
padded_x = torch.nn.functional.pad(x, (receptive_field, receptive_field))

# run inference in a loop on chunk_size
chunk_outputs = []
chunk_size = 20
i = 0
while i < padded_x.size(2) - 2 * receptive_field:
    chunk = padded_x[:, :, i: i + chunk_size + 2 * receptive_field]
    chunk_outputs.append(
        non_causal_chunk_conv_layer(chunk)
    )
    i += chunk_size
chunked_y = torch.cat(chunk_outputs, 2)
assert chunked_y.shape == y.shape
assert torch.all(chunked_y == y)

If you have a stack of convolutional layers, their receptive fields simply add up, but the method remains the same.

Causal Conv1d

For online processing (such as live denoising or voice conversion), latency is influenced by both chunk_size and the receptive_field of the convolutional kernel on the right, also known as lookahead. While chunk size is adjustable, the receptive field is limited by the architecture. To reduce latency, one should aim to design a convolution with an asymmetrical receptive field. In the extreme case, with no lookahead, this results in a causal convolutional layer:

Causal convolution

This is achieved by asymmetrically padding the convolution, padding only on the left by kernel_size - 1:

causal_conv_layer = torch.nn.Conv1d(
    1,  # input channels
    1,  # output channels
    kernel_size,
    bias=False,
    padding=0,  # need to do padding manually for assymetric case
)
padded_x = torch.nn.functional.pad(x, (kernel_size - 1, 0))

y = causal_conv_layer(padded_x)
assert x.shape == y.shape

Inference in chunks does not differ significantly from a regular convolution, except that there is only one receptive field located on the left of the input.

# run inference in a loop on chunk_size
chunk_outputs = []
chunk_size = 20
i = 0
receptive_field = kernel_size - 1
while i < padded_x.size(2) - receptive_field:
    chunk = padded_x[:, :, i: i + chunk_size + receptive_field]
    chunk_outputs.append(
        causal_conv_layer(chunk)
    )
    i += chunk_size
chunked_y = torch.cat(chunk_outputs, 2)
assert chunked_y.shape == y.shape
assert torch.all(chunked_y == y)

Transposed Conv1d

In audio or image processing, low-dimensional latent representations often need to be upsampled back to samples or pixels. This is achieved through transposed convolution with strides. A detailed explanation of this can be found in a blogpost on the topic. In short, each input point expands into multiple output points. The stride determines the degree of upsampling performed by the transposed convolution, usually set so kernel_size = stride * 2 to prevent checkboard artifacts. Two neighboring input points contribute to each output point. Padding in this case actually reduces the number of output points at the edges, ensuring that stride * len(input) output points are produced.

Transposed convolution with stride

import torch

upsample_rate = 4
kernel_size = upsample_rate * 2 + upsample_rate % 2
padding = (kernel_size - upsample_rate) // 2

transposed_conv_layer = torch.nn.ConvTranspose1d(
    in_channels=1,
    out_channels=1,
    kernel_size=kernel_size,
    stride=upsample_rate,
    padding=padding,
    bias=False,
)

y = transposed_conv_layer(x)  # (1, 1, 400)
print(y.shape)
assert y.shape == (x.size(0), x.size(1), x.size(2) * upsample_rate)

Running transposed convolution in chunks is similar to regular convolution: edges of the output are trimmed, input is padded, and inference is performed on overlapping chunks.

Transposed convolution with stride in chunks

Computing parameters for streaming inference differs from regular convolution:

# we will run inference with overlap,
# which needs to be taken into account
# in the slicing
extra_samples = (kernel_size - upsample_rate) * 3 // 2 - upsample_rate % 2  # how much extra output samples on the left and right
transposed_chunk_conv_layer = torch.nn.ConvTranspose1d(
    in_channels=1,
    out_channels=1,
    kernel_size=kernel_size,
    stride=upsample_rate,
    padding=extra_samples,
    bias=False
)
transposed_chunk_conv_layer.weight = transposed_conv_layer.weight

chunk_outputs = []
chunk_size = 20
i = 0
# each output contributed by 2 inputs, so overlap is 1
overlap = kernel_size // (2 * upsample_rate)
# need to pad so edges are handled correctly,
# this padding is taken into account in slicing
padded_x = torch.nn.functional.pad(x, (overlap, overlap))
while i < padded_x.size(2) - 2 * overlap:
    chunk = padded_x[:, :, i: i + chunk_size + 2 * overlap]
    res = transposed_chunk_conv_layer(chunk)
    chunk_outputs.append(res)
    i += chunk_size
chunked_y = torch.cat(chunk_outputs, 2)
assert chunked_y.shape == y.shape
assert torch.all(chunked_y == y)

Fourier transform

Many image and audio processing techniques still incorporate elements from classical signal processing. For audio, it’s common to extract a spectrogram to downsample the redundant audio signal while preserving the most relevant information. During training, this can be achieved using torch.stft. When deploying the model, however, there are challenges in tracing this operation across different CPU and GPU precisions. A workaround involves reformulating spectrogram extraction as a convolution with strides. This approach is already implemented in nnAudio. Here, the STFT is executed with a precomputed convolution where the kernel size matches the number of FFT points and the stride equals the hop size between windows.

Extracting spectrogram looks like this:

import torch
from nnAudio.features.stft import STFT

win_length = 1024
downsample_rate = 320
stft = STFT(
    n_fft=win_length,
    win_length=win_length,
    hop_length=downsample_rate,
    # disabling padding
    # https://github.com/KinWaiCheuk/nnAudio/blob/9e9a4bad230d175f7ad541309829483f1274a3e5/Installation/nnAudio/features/stft.py#L278
    center=False,
    output_format="Magnitude",
    pad_mode="constant"
)

total_frames = 30
total_samples = win_length + (total_frames - 1) * downsample_rate
x = torch.randn(1, total_samples)
y = stft(x)
assert y.size(2) == total_frames

When computing the spectrogram in chunks, the same approach is applied as with causal convolution:

chunk_size = 5
chunk_size_samples = chunk_size * downsample_rate
# overlap between the frames
receptive = win_length - downsample_rate
start = 0
chunked_y_lst = []
while start <= x.size(1) - chunk_size_samples - receptive:
    chunk = x[:, start:start + chunk_size_samples + receptive]
    chunked_y_lst.append(stft(chunk))
    start += chunk_size_samples
chunked_y = torch.cat(chunked_y_lst, dim=2)
assert chunked_y.shape == y.shape
assert torch.all(torch.abs(chunked_y - y) < 1e-3)

Inverse Fourier transform

The inverse Fourier transform is surprisingly more complex. Let’s revisit the audio example to understand why. Overlapping frames create interesting patterns that influence which frames affect which samples in the output.

Overlapping frames in the Inverse Fourier transform

In the illustration above, a chunk of 6 frames is shown with framing parameters of n_fft = 1024 and hop_length = 320. Since n_fft % hop_length != 0, the number of frames that affect the output samples varies between 3 and 4. For the edges of the input, it is fewer, and these regions should be considered the receptive field.

Just like before, executing the inverse Short-Time Fourier Transform (iSTFT) on the entire input:

import torch
from nnAudio.features.stft import iSTFT

win_length = 1024
upsample_rate = 320
istft = iSTFT(
    n_fft=win_length,
    win_length=win_length,
    hop_length=upsample_rate,
    center=False,
)

total_frames = 100
x = torch.randn(1, win_length//2 + 1, total_frames, 2)
y = istft(x, onesided=True)

receptive = win_length - upsample_rate
# to have an even upsampling, we should slice half of the receptive.
# this leaves some edge effects however 
y_padded = y[:, receptive // 2:-receptive // 2]
assert y_padded.size(1) == total_frames * upsample_rate
# keeping only output that doesn't have edge effects,
# we need to slice off entire receptive field
y = y[:, receptive:-receptive]

Running in chunks includes manual slicing from the output of the iSTFT, to remove regions without boundary effects:

import numpy as np

chunked_y_lst = []
start = 0
chunk_size = 5
overlap = int(win_length / upsample_rate)
while start <= total_frames - chunk_size:
    chunk = x[:, :, start:start + chunk_size + overlap]
    chunk_out = istft(chunk, onesided=True)
    left = win_length - win_length % upsample_rate
    right = receptive
    chunk_out = chunk_out[:, left:-right]

    chunked_y_lst.append(chunk_out)
    start += chunk_size

chunked_y = torch.cat(chunked_y_lst, dim=1)
# some of the original output is lost
lost = upsample_rate - win_length % upsample_rate
y_with_lost = y[:, lost:chunked_y.size(1) + lost]

assert torch.mean(torch.abs(chunked_y - y_with_lost)) < 1e-5

Putting it all together (Transposed Conv)

Let’s integrate everything and examine how layers might interact in a typical audio-to-audio stack, where audio is first downsampled to a latent representation and then upsampled back. The model might look something like this:

from typing import List
import torch
from nnAudio.features.stft import STFT

def create_conv_stack(kernels: List[int], in_channels: int = 1) -> torch.nn.Sequential:
    """
    Creates a dummy convolutional stack
    """
    lst = []
    for i, k in enumerate(kernels):
        ic = in_channels if i == 0 else 1
        lst.append(
            torch.nn.Conv1d(
                ic,  # input channels
                1,  # output channels
                k,
                bias=False,
                padding=0,
            )
        )
    return torch.nn.Sequential(*lst)

def create_transpose_conv(upsample_rate: int) -> torch.nn.ConvTranspose1d:
    """
    Creates dummy transposed convolutional layer that upsamples the input signal
    by given ratio
    """
    kernel_size = upsample_rate * 2 + upsample_rate % 2
    extra_samples = (kernel_size - upsample_rate) * 3 // 2 - upsample_rate % 2
    return torch.nn.ConvTranspose1d(
        in_channels=1,
        out_channels=1,
        kernel_size=kernel_size,
        stride=upsample_rate,
        padding=extra_samples,
        bias=False
    )

"""
Finally the model, which is a stack of
STFT -> conv_stack -> in_conv -> upsampling -> conv_stack -> upsampling -> conv_stack -> out_conv
"""
model = torch.nn.Sequential(
    STFT(
        n_fft=1024,
        win_length=1024,
        hop_length=320,
        center=False,  # disables the padding
        output_format="Magnitude",
        pad_mode="constant"
    ),
    create_conv_stack([5, 5, 5], 513),
    create_conv_stack([7]),
    create_transpose_conv(5),
    create_conv_stack([3, 5, 11]),
    create_transpose_conv(64),
    create_conv_stack([3, 5, 11]),
    create_conv_stack([7]),
)

Given what we’ve learnt so far, lets define the receptive field for each layer, to understand how much context on the left and on the right our model requires

# define receptive fields of each layer in the stack
# for each layer, specify (left_receptive, right_receptive, resolution)
# notice how STFT and transposed conv layers change the resolution
receptives_with_resolutions = [
    (1024 - 320, 0, 1),  # STFT: requires (win_len - hop_size) on the left 
    ((5 - 1) * 3, 0, 320),  # conv_stack: 3 causal layers with receptive of (kernel_size - 1) 
    (7 // 2, 7 // 2, 320), # in_conv: non-causal conv layer with symmetric repective of kernel_size // 2
    (1, 1, 320),  # transposed conv: receptive is defined by overlap = kernel_size // (2 * upsample_rate) == 1
    (2 + 4 + 10, 0, 64),  # conv_stack: 3 causal layers with varying kernel_size
    (1, 1, 64),  # transposed conv: different upsample rate, but it doesn't affect overlap
    (2 + 4 + 10, 0, 1),  # conv_stack: another 3 causal layers with varying kernel_size
    (7 // 2, 7 // 2, 1), # out_conv: another non-causal conv layer with symmetric repective
]
# bring all to the same resolution (samples)
receptives = [(left * res, right * res) for left, right, res in receptives_with_resolutions]
# this is our overlap in case of chunked synthesis
left = sum([left for left, _ in receptives])  # 6931
# and this is our architectural latency, in case of online processing
right = sum([right for _, right in receptives])  # 1347

To confirm that we calculated the receptive field correctly, we can run the model on a dummy input and check the input/output dimensionality. The input length should be compatible with model downsampling. In our case, the input length should generate a whole number of outputs for the STFT layer. This is the case if input_length = hop_length * N + win_length.

input_length = 320 * 50 + 1024
x = torch.zeros(1, input_length)
y = model(x)
# if we calculated receptive field correctly,
# model should strip off left/right receptives
assert input_length - left - right == y.size(2)

From previous examples, for inference in chunks, we need to shift the input by chunk_size, which is the input size without receptive fields. In our case, the chunk_size is 320 * N + 1024 - left - right. For N==50, it is 8746. There is a problem, however. We can only shift the input by the stride of the downsampling layer(s), in our case by M * 320. For most architectures, there is no way to satisfy both requirements:

Shift by chunk_size
Shift by M * downsampling_stride To overcome this issue, we’ll have to drop some extra samples from the output, in order to be able to do inference in chunks:

chunk_size = input_length - left - right  # 8746
extra_to_drop = chunk_size % 320  # 106

Now we are all set to check if inference in chunks works. As before, for a dummy input we will run inference on the whole input, and then on chunks of the input and compare the results.

# some random input
x = torch.randn((1, 100000))
# inference on the whole input
y = model(x)

# running inference on the first chunk
y_chunk_1 = model(x[:, :input_length])
# running inference on the second chunk, carefully shifting input
start = chunk_size - extra_to_drop
y_chunk_2 = model(x[:, start:(start + input_length)])
# now we need to slice off the `extra_to_drop` from both outputs
y_chunk_list = [y_chunk_1, y_chunk_2]
y_chunk_list = [chunk[:, :-extra_to_drop] for chunk in y_chunk_list]
y_chunk = torch.cat(y_chunk_list, dim=2)

# finally compare the chunked inference to the original one.
# we run inference only on two chunks, omitting handling the
# padding on the right for simplicity. So the comparison
# is done only for beginning of the output.
diff = y[:, :, :y_chunk.size(2)] - y_chunk
# should be the same
assert torch.all(torch.abs(diff) < 1e-3)

Putting it all together (iSTFT)

Now, let’s do the same, but replace the upsampling with transposed convolutions and use the inverse STFT (iSTFT). Once again, we will ensure that carefully computing the total receptive field of all layers allows us to run inference on chunks of input.

Model definition:

from typing import List
import torch
from nnAudio.features.stft import STFT, iSTFT

def create_conv_stack(kernels: List[int], in_channels: int, out_channels: int) -> torch.nn.Sequential:
    """
    Creates a dummy convolutional stack
    """
    lst = []
    for i, k in enumerate(kernels):
        ic = in_channels if i == 0 else out_channels
        lst.append(
            torch.nn.Conv1d(
                ic,  # input channels
                out_channels,  # output channels
                k,
                bias=False,
                padding=0,
            )
        )
    return torch.nn.Sequential(*lst)

class iSTFTWrap(torch.nn.Module):
    def __init__(self, n_fft, hop_length):
        super().__init__()
        self._istft = iSTFT(
            n_fft=n_fft,
            win_length=n_fft,
            hop_length=hop_length,
            center=False,
        )
        self._left = n_fft - n_fft % hop_length
        self._right = n_fft - hop_length
    
    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # the input x is (batch x 1026 x frames)
        x = x.view(x.size(0), 2, x.size(1) // 2, x.size(2))  # (batch x 2 x 513 x frames)
        x = x.permute(0, 2, 3, 1)  # (batch x 513 x frames x 2)
        y = self._istft(x, onesided=True)
        return y[:, self._left:-self._right]

"""
Finally the model, which is a stack of
STFT -> conv_stack -> out_conv -> iSTFT
"""
model = torch.nn.Sequential(
    STFT(
        n_fft=1024,
        win_length=1024,
        hop_length=320,
        center=False,  # disables the padding
        output_format="Magnitude",
        pad_mode="constant"
    ),
    create_conv_stack([5, 5, 5], 513, 1),
    create_conv_stack([7], 1, 513 * 2),
    iSTFTWrap(
        n_fft=1024,
        hop_length=320,
    )
)

# define receptive fields of each layer in the stack
# for each layer, specify (left_receptive, right_receptive, resolution)
receptives_with_resolutions = [
    (1024 - 320, 0, 1),  # STFT: requires (win_len - hop_size) on the left 
    ((5 - 1) * 3, 0, 320),  # conv_stack: 3 causal layers with receptive of (kernel_size - 1) 
    (7 // 2, 7 // 2, 320), # projection conv: non-causal conv layer with symmetric repective of kernel_size // 2
    (960, 0, 1), # iSTFT: requires hop_length * overlap * 2 on the left and receptive on the right
]
# bring all to the same resolution (samples)
receptives = [(left * res, right * res) for left, right, res in receptives_with_resolutions]
# this is our overlap in case of chunked synthesis
left = sum([left for left, _ in receptives])  # 6464
# and this is our architectural latency, in case of online processing
right = sum([right for _, right in receptives])  # 960

Now, let’s compute the expected input/output lengths. Notice that there is no extra_to_drop because the chunk size is divisible by the hop_size.

input_length = 320 * 50 + 1024
chunk_size = input_length - left - right  # 9600

Finally, let’s confirm that running inference on chunks produces the same result as when processing the entire input.

# some random input
x = torch.randn((1, 100000))
# inference on the whole input
y = model(x)

# running inference on the first chunk
y_chunk_1 = model(x[:, :input_length])
# running inference on the second chunk, carefully shifting input
start = chunk_size - extra_to_drop
y_chunk_2 = model(x[:, start:(start + input_length)])
# now we need to slice off the `extra_to_drop` from both outputs
y_chunk_list = [y_chunk_1, y_chunk_2]
if extra_to_drop > 0:
    y_chunk_list = [chunk[:, :-extra_to_drop] for chunk in y_chunk_list]
y_chunk = torch.cat(y_chunk_list, dim=1)

y_trimmed = y[:, :y_chunk.size(1)]
# finally compare the chunked inference to the original one.
# we run inference only on two chunks, omitting handling the
# padding on the right for simplicity. So the comparison
# is done only for beginning of the output.
diff = y_trimmed - y_chunk
# should be the same
assert torch.mean(torch.abs(diff)) < 1e-5

Takeaways

In this post, we delved into how to perform streaming inference (a.k.a., inference in chunks) for models consisting of various convolutional layers. This insight is crucial for building online audio or image processing applications. It boils down to carefully computing the receptive field of the resulting model and managing overlap between input chunks. When multiple layers are combined, however, additional attention should be paid to feeding your model with input of proper length, which in turn requires more sophisticated input/output handling. Hopefully, the explanations and code snippets provided will help you navigate these challenges in your own architectural designs.

Dissecting BARK

2023-08-15T00:00:00+00:00

Things started to get stale after the ubiquitous switch to Neural Text-to-Speech. A long-awaited leap forward was introduced thanks to ideas from the blossoming image generation field. TorToiSe adopted techniques introduced in DALL-E[1] and simultaneously pushed frontiers of:

expressive speech synthesis
paralinguistic generation
voice cloning

These improvements became possible due to more powerful generative models and an unprecedented training scale. Instead of traditional 20 hours of speech and 10-50M parameters models, TorToiSe used thousands of hours and hundreds of millions of trainable parameters. It sparked a whole series of papers that developed the approach further, with AudioLM[2], VALL-E[3], and SPEAR-TTS[4] being the most prominent among others. In this post, we will explore the internals of the BARK - an open-source implementation of this new speech synthesis paradigm.

Architecture

BARK quite closely follows VALL-E architecture, utilizing a large autoregressive transformer decoder to operate on discrete speech representations.

Tokenization

Discrete speech representations or tokens are obtained from two pre-trained models - HuBERT[5] and Encodec[6]:

Tokenization in BARK

HuBERT is a semi-supervised model that converts audio to discrete “semantic tokens.” The objective of the HuBERT training makes extracted representations speaker- and prosody-(quasi)independent. Think of pseudo-phonemes annotation on a frame level.

Encodec is a neural vocoder that works on multi-level discrete representations extracted from audio in an auto-encoding manner using vector quantization. Think of mel-spectrogram frames but discrete. Low-order representations are called “coarse tokens,” and higher-order ones are “fine tokens.”

Acoustic Modeling

By discretizing the speech into tokens, we reformulate the speech production task into predicting coarse and fine tokens from semantic tokens. Such formulation allows us to use a state-of-art generative model - an autoregressive transformer decoder. The very same model is used in ChatGPT. A lot of data is needed to train this beast: billions of tokens or thousands of hours of speech. This magnitude is only possible in a multi-speaker scenario, which requires to condition a generative model on speaker identity. It is done via “prompting” (natural language processing parallels intensify), where prompt - is an utterance of a speaker predecessing the current one. The prompt carries information about speaker identity, recording conditions, and even some high-level prosody aspects, but not the actual content. Fine tokens just refine the acoustic information from coarse tokens and don’t need as powerful modeling. The transformer encoder (i.e. parallel architecture) is used for fine tokens prediction to speed things up.

Acoustic modeling in BARK

Coarse and fine tokens have multiple levels. To plug them into the models, token sequences are simply flattened.

Neural Frontend

Semantic tokens are extracted from the audio. For a text-to-speech task, they need to be predicted from the input text. It is done with yet another autoregressive transformer decoder. This task also requires a powerful generative model since it contributes to the resulting intonation. Converting text to semantic tokens is a sequence-to-sequence task, where correspondence is defined by durations of sounds and overall speech pace.

Neural Frontend in BARK

A sufficiently sizeable neural frontend has no problems generating semantic tokens for inputs in multiple languages or consistently annotated paralinguistics (laughs, breaths, gasps, etc.).

Performance

Generating speech with BARK is not fast. Let’s have a look into which components are the most demanding. Measurements are done on GPU, averaging inference time for multiple utterances of roughly 10 seconds each.

Model	Function	Parameters	Average Inference time, s
HuBERT	audio → semantic tokens	95M	0.035
Encodec Decoder	coarse/fine tokens → audio	15M	0.025
Text AR Transformer Decoder	text → semantic tokens	446M	11
AR Transformer Decoder	semantic tokens → coarse tokens	328M	45
Transformer Encoder	coarse tokens → fine tokens	319M	0.37

Flattening coarse/fine tokens requires AR Transformer Decoder to work on a very long context. This makes it the slowest component of the whole pipeline by far.

Speeding things up

It’s not the first time slow autoregressive modeling has crossed out real-time speech generation. An autoregressive version of WaveNet[7] also puzzled the community back in the day with outstanding quality at the cost of extremely slow inference. But things got faster both with inference optimizations and modeling advances. The same applies in this case. For example, NaturalSpeech 2[8] proposes employing a parallel diffusion model instead of an autoregressive transformer decoder as a possible mitigation. We will have a brief look into possible inference optimizations.

RWKV

The quadratic complexity of attention fuels interest in so-called attention-free architectures. RWKV - is a parallelizable RNN with a performance of a classical transformer and linear complexity with respect to context length. Here is some code samples allowing to drag-race a dummy RWKV model and estimate expected gains. Creating a dummy model of 350M parameters (from within RWKV-v4):

from src.model import GPT, GPTConfig
import torch
model = GPT(GPTConfig(12096, 1024, model_type="RVKW", n_layer=24, n_embd=1024))
device = torch.device("cuda")
model.to(device)
torch.save(model.state_dict(), "rwkv_gpt2_medium.pth")

Run generation with a dummy model using rwkv from pip:

import time
import os
os.environ['RWKV_JIT_ON'] = '1'
os.environ["RWKV_CUDA_ON"] = '1'

from rwkv.model import RWKV

model = RWKV(model='rwkv_gpt2_medium.pth', strategy='cuda fp16')
state = None
# warm up
for i in range(100):
    _, state = model.forward([100], state)
# measure generation of 1k tokens
state = None
start = time.time()
for _ in range(1000):
    _, state = model.forward([100], state)
print(time.time() - start)

It takes ~12 seconds which is already a valuable improvement. It will become even more prominent with longer contexts and bigger models.

Faster Transformer

Faster Transformer is a library by NVIDIA which implements heavily optimized inference of Large Language Models. It runs in a dedicated docker container with custom CUDA kernels for particular models. Here is a small snippet of code to check out the performance on 350M GPT model:

# pulling container and building FT for you GPU
docker pull nvcr.io/nvidia/pytorch:22.09-py3
nvidia-docker run -ti --shm-size 5g --rm nvcr.io/nvidia/pytorch:22.09-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
mkdir -p FasterTransformer/build
cd FasterTransformer/build
git submodule init && git submodule update
# from https://github.com/NVIDIA/FasterTransformer/issues/90 for RTX3090
cmake -DSM=86 -DCMAKE_BUILD_TYPE=Release -DBUILD_PYT=ON ..
make -j12

# pulling 350M GPT model
pip install -r ../examples/pytorch/gpt/requirement.txt
git clone https://huggingface.co/gpt2-medium
curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get install git-lfs
cd gpt2-medium && git lfs pull && cd ..
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py -i gpt2-medium/ \
    -o ../models/huggingface-models/c-model/gpt2-medium -i_g 1
echo "hello world" > context.txt

# run generation on a GPU for 1k tokens
time CUDA_VISIBLE_DEVICES=1 python ../examples/pytorch/gpt/multi_gpu_gpt_example.py \
    --ckpt_path ../models/huggingface-models/c-model/gpt2-medium/1-gpu/ \
    --time --inference_data_type fp16 --tensor_para_size 1 --pipeline_para_size 1 \
    --beam_width 1 --top_k 1 --top_p 0 --temperature 1.0 --return_cum_log_probs 0 \
    --output_len 1000  --vocab_file gpt2-medium/vocab.json --merges_file gpt2-medium/merges.txt  \
    --max_batch_size 1 --min_length 1000 --lib_path lib/libth_transformer.so \
    --sample_input_file context.txt

It takes only 1.5 seconds, a mind-blowing speed up compared to the original performance.

References

[1] Hierarchical Text-Conditional Image Generation with CLIP Latents

[2] AudioLM: a Language Modeling Approach to Audio Generation

[3] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

[4] Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision

[5] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

[6] High Fidelity Neural Audio Compression

[7] WaveNet: A Generative Model for Raw Audio

[8] NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

Zero-shot speech generation benchmark

2023-07-31T00:00:00+00:00

Synthesizing speech with a speaker identity not seen during training presents a significant challenge. Traditionally, achieving this required extensive training on many speakers to ensure a continuous speaker space[1]. The most performant methods, such as RVC, still need minimal fine-tuning with ~10 minutes of target speaker data to achieve reasonable quality. However, the approaches leveraging the power of big models are gaining momentum. For instance, Microsoft’s VALL-E[2] boldly claims to clone a speaker’s voice with just 3 seconds of speech as a reference. In this blog post, we aim to present a benchmark of voice conversion technologies, comparing Revoice to the widely spread zero-shot VC baselines.

Testsets

Typical evaluations of Voice Conversion systems rely on objective metrics collected from running conversion on unseen multi-speaker corpora. We design the evaluation to be insightful for the Revoice use-case. We use multi-speaker corpora as a source or input audio and a library of speakers from Revoice app as a target or reference audio. Input audio is derived from:

VCTK - classical voice conversion benchmark. Clean recordings, multiple accents.
DAPS corpus[3] - emulated mobile device recordings in various conditions. This dataset resembles the audio quality we obtain as a Voice Conversion service more closely.

Metrics

We measure three model-based objective metrics for the converted speech:

Speaker similarity: we measure a cosine distance between a latent speaker representation from converted speech and reference audio. We use ECAPA[4] speaker encoder by Speechbrain to extract speaker representation.
Speech intelligibility: we run speech recognition with Conformer-Transducer ASR model by NVIDIA on the converted speech and measure the Character Error Rate with respect to the transcription.
Naturalness: we use a pre-trained Mean Opinion Score estimator UTMOS[5] released by the authors.

Baselines

We select two widely-spread systems as the baselines. Both are trained on a large number of speakers and are capable of zero-shot speech generation.

YourTTS[6] (from 2021) is a VITS architecture model with adjustments trained on VCTK + LibriTTS datasets. It uses an invertible normalizing flow to disentangle speaker identity from the spectrogram representation. Handy tutorial on how to run it can be found here.
BARK (from 2022) is a large (350M parameters) decoder-only transformer that generates speech from “semantic tokens.” Those are self-supervised representations extracted with HuBERT[7] that effectively disentangle content (semantics) and speaker characteristics. Running Voice Conversion with BARK is not straightforward, because extraction of semantic tokens is not released. Suno.ai only provides prediction of semantic tokens from text. Fortunately, there is a community contributed semantic tokens extractors that are compatible with BARK. This addition allows to create own voice profiles and perform voice conversion, adjusting semantic tokens and voice profiles in this notebook.

The autoregressive transformer decoder in BARK is significantly slower than parallel conversion in YourTTS, but it has greater potential due to the model’s scalability.

Results

We present results of the evaluations in the tables below. Here is performance of the systems on VCTK:

Model	Naturalness(MOS↑)	Intelligibility(CER, %↓)	Similarity(inverted cosine distance↓)
no model	4.06	0.17	-
YourTTS*	3.21	1.08	0.613
BARK	3.49	2.58	0.692
Revoice	3.45	1.36	0.614

And performance on DAPS:

Model	Naturalness(MOS↑)	Intelligibility(CER, %↓)	Similarity(inverted cosine distance↓)
no model	2.39	2.755	-
YourTTS	2.08	26.7	0.655
BARK	2.85	14.77	0.738
Revoice	2.81	16.56	0.564

Small example of how systems actually sound. For the these inputs:

Source audio

Reference of target voice

The systems produce following outputs:

YourTTS

BARK

Revoice

YourTTS shows excellent performance on VCTK but degrades significantly on more noisy inputs. BARK consistently delivers clean and intelligible audio, but the speaker similarity lags. Revoice competes with BARK in terms of naturalness and intelligibility while making a leap forward in terms of speaker similarity.

References

[1] Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis

[2] Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

[3] Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech? — A Dataset, Insights, and Challenges

[4] ECAPA-TDNN Embeddings for Speaker Diarization

[5] UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

[6] YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

[7] HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

* YourTTS uses VCTK in training, which might give slightly overly optimistic results.

Українська мова в Balacoon

2023-07-08T00:00:00+00:00

Швидкий, зручний та якісний нейромережевий синтез українського мовлення тепер в Balacoon. Інтеграція бібліотеки синтезу ще ніколи не була такою простою: Python пакети без залежностей для real-time генерації на CPU, Docker контейнер здатний обробляти десятки паралельних запитів на GPU, найшвидший on-device синтезатор, який дозволяє real-time синтез навіть на RaspberryPi. І це все тепер безкоштовно доступне для української мови під MIT ліцензією.

Приклад:

Сгенеруйте більше прикладів в нашому онлайн демо.

Реліз

Дякуємо спільноті синтезу українського мовлення за створення, популяризацію і підтримку відкритих датасетів. На їх основі, ми побудували 2 моделі:

JETS - стандартна мульти-спікер модель з частотою дискретизації 24kHz. Підтримує усі наявні голоси: Лада, Тетяна і Микита. Росповсюджується в двох варіантах:
- uk_ltm_jets_cpu.addon - для синтезу на CPU за допомогою Python пакету balacoon_tts.
- uk_ltm_jets_gpu.addon - для сервісу в Docker контейнері з використанням GPU.
Light - полегшена модель з частотою дискретизації 16kHz для надшвидкої генерації. Підтримує голос Тетяни. Розповсюджується тільки варіант для CPU: uk_tetiana_light_cpu.addon.

Для аналізу тексту, усі моделі використовують espeak з додатковим словником наголосів.

Чого бракує

Було б добре оновити підхід до аналізу тексту, а саме:

побудувати правила для нормалізації тексту за допомогою Finite-State-Transducers. Balacoon підтримує цю технологію і має реалізацію для англійської мови. Такий підхід легше пітримувати і розширювати, додаючи нові правила.
Визначення наголосів потребує рішення з контекстуалізованою генерацію вимови[1],[2]. Цей підхід нажаль ще не підтримується в Balacoon але ми сподіваємося додати загальне рішення, яке б було корисним для усіх мов з омографами. Як тимчасове рішення, користувачі можуть вказувати бажані наголоси за допомогою “акутів”.

Також планується додати підтримку багатомовного синтезу. Зараз проблема генерації латиниці вирішується простими правилами. Але сучасним рішенням було б створення системи синтезу з підтримкою багатьох мов. Balacoon працює з уніфікованим набором фонем, що має спростити такий перехід.

Підтримка та відгуки

Долучайтеся до нашого slack каналу. Обов’язково пишіть як ви використовуєте Balacoon, що працює добре, а що не дуже.

Посилання

[1] SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation

[2] Homograph disambiguation with contextual word embeddings for TTS systems

Balacoon TTS on-device

2023-04-15T00:00:00+00:00

Neural text-to-speech brought unprecedented improvements in the naturalness of synthetic speech. But it came with a cost. While parametric and concatenative speech synthesis systems produce tens of seconds of audio in just 1 second of wall time (they deliver >10 xRT*) on a single CPU core, neural TTS requires way more computational power. You often need a GPU to provide compelling latency for responsive applications. Fortunately, when there is a will, there is a way. Let’s dive into on-device Neural TTS and see what Balacoon has to offer.

On-device Neural TTS recap

Several milestones of Neural TTS evolution are worth mentioning in this regard. Generating raw waveform is the most computationally expensive part of synthesis. WaveRNN[1] from Google pioneered real-time synthesis on CPU. The authors used sparsification (dropping most neural network weights) and subscaling (generating multiple samples simultaneously) to achieve remarkable results. Later LPCNet[2] brought these advances, as well as an idea of mixing signal processing with neural networks, to the public. And finally, in a trend of GAN-based vocoding overtaking the domain, MB-MelGAN[3] came forward by breaking the curse of auto-regressive waveform generation.

Acoustic features prediction was a less acute problem and down-scaled reasonably well. The most widely spread FastSpeech2[4] already has only 30M parameters and runs reasonably fast. And with LightSpeech[5], Microsoft has shown that it is possible to shrink it down to 2M parameters.

So once VITS[6] and JETS[7] paved the way to end-to-end speech synthesis, combining acoustic features prediction and vocoding, it was already clear that low resource end-to-end TTS is just around the corner. Indeed NIX-TTS[8] came into the game, squishing the whole Neural TTS backend into 5M parameters that run 0.5xRT on a Raspberry PI 3B.

Implementations available

While LPCNet is not so widely used anymore, it is worth mentioning because the implementation contains valuable engineering insights, such as sparsification and vectorization. TensorFlowTTS combines mentioned FastSpeech2 and MB-MelGAN in an android example powered by TFLite. Nix-TTS authors release their code and models. And lastly, there is Piper, which competes with Nix-TTS in terms of performance (also 5M parameters models), but instead of distillation, it simply downscales VITS architecture.

Introducing Light💡

We composed our own version of the lightweight TTS model called Light. It has fewer parameters compared to default JETS models. Therefore it compromises quality and multi-speaker, multi-lingual capabilities. It also delivers only 16kHz audio instead of 24kHz. On the other hand, it provides an order of magnitude faster synthesis on the CPU.

Degradation compared to full-scale model on the held-out test set of “92” Hi-Fi speaker:

Model	Naturalness (MOS↑)	Intelligibility (CER, %↓)
recordings	3.92	0.32
en_us_hifi_jets_cpu.addon	4.0	0.28
en_us_hifi92_light_cpu.addon	3.89	0.32

Synthesis speed on AMD Ryzen Threadripper 1950X:

Model/System	faster than real-time (xRT↑)
en_us_hifi_jets_cpu.addon	6.02
Piper (ljspeech)	29.15
en_us_hifi92_light_cpu.addon	50.86

Synthesis speed on Raspberry PI 3B with Cortex-A53:

Model/System	faster than real-time (xRT↑)
Piper (ljspeech)	1.13
en_us_hifi92_light_cpu.addon	2.33

You can try out en_us_hifi92_light_cpu.addon in our huggingface space and use it with balacoon_tts python package as described in a tutorial.

References

[1] Efficient Neural Audio Synthesis

[2] LPCNet: Improving Neural speech synthesis through linear prediction

[3] Multi-Band MelGAN: Faster waveform generation for high-quality Text-to-Speech

[4] FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

[5] LightSpeech: Lightweight and fast Text-to-Speech with Neural Architecture Search

[6] Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

[7] JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech

[8] NIX-TTS: Lightweight and end-to-end Text-to-Speech via module-wise distillation

* There is a certain confusion around “xRT” (times real-time) terminology. Some people mean “how much audio is produced in one second of walltime”; others refer to “how much time it takes to synthesize one second of audio”. While the latter is generally more popular, I stick with the former because numbers like “30xRT” and “50xRT” are easier to comprehend and compare than “0.033xRT” and “0.02xRT”.

Balacoon TTS as a service

2023-03-20T00:00:00+00:00

In recent years, text-to-speech technology has made tremendous strides, thanks in large part to advances in machine learning and artificial intelligence. As a result, synthetic speech is now almost indistinguishable from human speech, and is being used in a variety of applications, from voice assistants to audiobooks.

However, while there are many cloud-based text-to-speech services available, (AWS Polly, Azure Text-to-speech, Google cloud Text-to-speech to name a few) these services can be expensive, and may not always be the best fit for every use case. That’s why we’re excited to announce the release of our new self-hosted text-to-speech service, which is available as a Docker image that you can spin up on a GPU instance.

With our self-hosted text-to-speech service, you can get state-of-the-art speech synthesis within your own infrastructure, without having to rely on cloud service providers. This can be especially useful for practitioners who need to power their app or service with synthetic speech in production, and who may have concerns about cost or security.

As the rest of the post delves into the internal workings of the service, we recommend taking a moment to review the usage documentation, which demonstrates how straightforward it is to establish a TTS endpoint.

How far 1 GPU can take you

This section aims to set expectations regarding the efficiency of Balacoon TTS, specifically in terms of how many users can be served using just one GPU to handle requests. Two primary metrics to consider are:

Latency - the amount of time a user must wait before obtaining the first chunk of audio.
Real-time factor (RTF) - the ratio of the duration of the synthesized audio to the time it took to produce it.

Configuring the endpoint involves finding a balance between these two metrics. Balacoon TTS server uses NVIDIA Triton Server internally, which enables batching of inference requests. The greater the number of requests that are batched and processed in parallel, the better the real-time factor will be. However, this comes at the cost of increased latency since processing more data in parallel requires more time. You have control over the maximum batch size to process, when you are launching the endpoint.

Balacoon TTS Service performance

It can be observed that beyond a certain point, increasing the batch size does not result in any significant increase in the amount of audio produced. In total, it is possible to generate 3.5 hours of speech in just 30 seconds, with each user starting to receive audio in as little as 100 milliseconds after the request. Check out the performance of classical combination of Tacotron2 and Waveglow for comparison.

There are other parameters that affect Latency/RTF, but these are hardcoded into the server and cannot be adjusted:

Chunk size - the amount of audio synthesized in a single processing unit. It is more efficient to synthesize larger chunks of audio, but this can increase latency. The chunk size for Balacoon TTS is set at 2 seconds.
Batching queue delay - the time to wait for the new requests before sending previously obtained ones as a batch. Balacoon TTS aggregates requests for 10ms.

Balacoon TTS version 0.1.0

2023-03-15T00:00:00+00:00

We’re excited to announce the release of Balacoon TTS 0.1.0, the latest version of our text-to-speech package. This new version includes two major updates that will significantly enhance its functionality.

We switch to the use of ONNX as the neural backend. It allowed us to drop the torch libraries and reduce the package size by a factor of 3, making it much more lightweight and easy to use. Using ONNX also provided a 1.4x speedup in synthesis speed
We add streaming synthesis API for low latency applications. While streaming synthesis is generally 2x slower due to redundant computations, it allows for audio to be sent back to the user immediately after the first chunk is produced, making it ideal for real-time applications. You can find the usage example in the docs.

One caveat is that the updates required us to retrain the TTS models. So you will need to update both package and addons.

ONNX Runtime

ONNX Runtime is a powerful open-source engine that provides a universal neural backend for deploying and optimizing deep learning models trained with different frameworks. It simplifies the release of a library to different platforms (Windows, RaspberryPi, Android are in the roadmap) and allows for different optimizations. Additionally, it enables the export of models to even faster backends such as TensorRT, which we will explore in the future. At present, we plan to use ONNX as a backend for CPU inference, although there are still some unresolved issues to address, such as half-precision inference on CPU.

Streaming synthesis

Streaming speech synthesis is an important technology that enables real-time generation of speech while reducing perceived latency[1]. This approach to speech synthesis breaks down the process of speech generation into smaller chunks, allowing the system to produce and deliver audio output in near real-time. This is particularly important for applications where low latency is critical, such as voice assistants, interactive voice response (IVR) systems, and chatbots. While streaming speech synthesis offers a faster response time, it comes at the cost of overall inference speed, as the system is constantly generating small audio segments in real-time. Despite this, streaming synthesis remains essential for applications where real-time audio feedback is necessary.

Streaming synthesis in action

The picture above illustrates the operating principle of streaming synthesis. The process begins with a frontend that takes in textual input and sends it to an encoder, which processes the input at the phoneme level. The encoder then upsamples the phonemes to create frame-level representations. A decoder then slides across these frame-level representations, converting them into audio output one small chunk at a time. By breaking down the speech synthesis process into smaller pieces, the system can produce and deliver speech output in real-time, reducing latency and enabling applications that require fast response times.

References

[1] High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. arxiv

Balacoon phonemeset

2023-01-16T00:00:00+00:00

Text-to-speech assumes the implicit or explicit conversion of input text into a sequence of sounds to be pronounced. Defining a set of all possible sounds (or phonemes) for the language can spark quite a debate. Fortunately, neural speech synthesis is quite flexible and can tolerate almost any annotation as long as it is consistent. This post describes how the phoneme set for the Balacoon Frontend is composed.

Ideally, we prefer a single unified phoneme set for all the locales. It would reduce the mess with mapping phonemes to inputs of neural networks. And make multi-lingual TTS easier, as the study shows [1]. The standard approach is to use International Phonetic Alphabet (IPA) or its ASCII counterpart X-SAMPA. Of course, modern computers support Unicode and can deal with IPA, but editing pronunciation dictionaries by hand can be challenging. Other TTS frontends either take this path or maintain a mapping from locale-specific phoneme set to X-SAMPA.

Basic phoneme set

To define a basic phoneme set, we go to the X-SAMPA wiki page and collect all the phonemes listed there. Some of them belong to exotic locales but let’s keep them just in case. Putting together all the pulmonic and non-pulmonic consonants, vowels, affricates, and coarticulated phonemes, we get 107 items. It might be incomplete since pronunciation dictionaries commonly introduce additional locale-specific phonemes, such as merged vowels. For example, for US English, one can typically find:

“aI” as in “price”
“aU” as in “flower”
“eI” as in “shade”
“OI” as in “choice”
“oU” as in “boat”

We make the questionable decision to keep those phonemes separate. It’s easier to merge phonemes rather than split them, and we would like to keep the phoneme set as narrow as possible. On the negative side, we must decide which of the two merged phonemes should get stress or modification marks, if any.

Allophones

Basic phonemes can have a great degree of variation in their pronunciation. Modifications of a phoneme are called allophones. Phonemes can be prolongated, have nasalization, palatalization, etc. The number of variations is significant. Moreover, they can overlap. Fortunately, a particular modification can be applied only to a subset of phonemes. We compose a set of allophones based on espeak-ng and Google Cloud Text-to-Speech.

Both support multiple locales and will allow us to oversee all the possibilities.

: marks prolongation and can be applied to 41 phonemes.
~ marks nasalization and can be stacked with prolongation. It applies to 29 phonemes.
' or _j denotes palatalization, which can affect 21 consonants.
``` indicates rhotacization, applied to 11 phonemes
_d means dental consonants. It occurs for 9 phonemes.
_h marks aspiration, which appears for 13 phonemes.
_<, _>, _o, _" denotes implosives, ejectives, lowered and centralized vowels. They did not occur in espeak-ng or Google docs, including only those 13 explicitly mentioned on X-SAMPA wiki.

Wiki mentions other modifications too, but since they do not appear in TTS systems, we omit them to keep the size of the phoneme set manageable. Combining collected allophones with the basic phonemes, we get a set of 244 entries.

Stress and tone

While stress and tone are also modifications of basic phonemes, we treat them differently. Both stress and tone apply to all vowels and their allophones. Adding them as-is would increase the size of the phoneme set immensely. Instead, we maintain stress and tone as separate input stream for speech modeling.

Usually, stress marks apply to syllables. But since Balacoon Frontend does not perform syllabification, nucleus of a syllable is getting marked:

" marks primary stress
% denotes secondary stress.

Both symbols are added in front of a phoneme. Tones as marked as:

_B for extra-low tone
_L for low tone
_F for falling tone
_B_L for low-rising tone
_M for mid-tone
_R_F for rising-falling tone
_R for rising tone
_H for high tone
_H_T for high-rising tone
_T for extra-high tone

Mapping phonemes to integers for neural speech production models

Neural TTS takes phoneme indices as inputs. Having a fixed phoneme set, we can define a unified mapping. That simplifies model deployment and makes combining datasets for multi-lingual text-to-speech easier. Apart from phonemes, it is also common to augment input with some service tokens, such as pauses, word boundaries, etc. We reserve the first 10 positions of the mapping for service tokens:

0 - end token: added at the end of the encoded sequence. This matches the default padding value for modeling frameworks. It is an artificial token, and it always has zero duration.
1 - start token: added at the beginning of the encoded sequence. Mirrors end token. It may have zero or non-zero duration.
2 - word boundary without pause. It is a token that signalizes the boundary between the words in fluent speech. It always has zero duration.
3 - word boundary with a pause. It is a token that indicates a break between two words. It always has a non-zero duration.
4 - falling tone sentence boundary. It is a token inserted on terminal punctuation, associated with falling intonation (. or !). It can happen at the end of an utterance, just before the end token, or in the middle if multiple sentences are synthesized together. Sentence boundary may or may not have silence associated with it.
5 - raising tone sentence boundary. Same as above but, inserted on terminal punctuation associated with raising intonation (?).

We reserve indices from 6 to 9 for any additional service tokens that might be helpful for Speech Synthesis. The phoneme set defined previously starts from index 10 and spans to 253. According to the definition above, we separately encode stress and tone in order, starting from index 1. Index 0 is reserved for phonemes without stress and for padding.

The complete phoneme set can be found within learn_to_pronounce. We did a trial of a freshly composed unified phoneme set, by mapping CMUDict and underlying ARPABET. It worked smoothly, notes on the process can be found here. We hope that pronunciation dictionaries in other locales wouldn’t pose a difficulty either. Which will pave the way to a frictionless multi-lingual TTS.

References

[1] Sanchez, A., Falai, A., Zhang, Z., Angelini, O., & Yanagisawa, K. (2022). Unify and Conquer: How Phonetic Feature Representation Affects Polyglot Text-To-Speech (TTS). link

en-US abbreviation detection

2023-01-15T00:00:00+00:00

Detecting abbreviations is crucial for proper text normalization and subsequent pronunciation generation. In broad terms, “abbreviation” means shortening, contraction, initialism, or acronym. In this post, we will focus on initialisms - entities made up of initial letters of words, which are pronounced as separate letters. It is essential to detect those reliably in TTS Frontend because custom pronunciation generation is needed. Additionally, proper detection of initialisms reduces the ambiguity of identifying sentence boundaries.

Usually, initialisms are written in capital letters, where each letter is followed by a dot, for example “F.B.I.” It is also common to write initialism without dots, i.e., “AI”, “USA” or “IBM”. Unfortunately, sometimes initialisms are written in lower case: “hr” or “www”.

As a result following disambiguation problems are typical:

Confusing with single-letter words. Simple case since the set of words is minimal (“I” and “a”).
Confusing with roman numerals. Even though roman numerals are somewhat rare, they are written as a sequence of capital letters from a certain set, which can be easily confusing.
Confusing with capitalized words. The most frequent case is when a capitalized word is recognized as initialism while it is just editing specifics.
Confusing with acronyms. Acronyms follow the same patterns as initialisms, but they are not spelled out. For ex. “NATO” or “AIDS”
Low-case abbreviations. Rather rare situation, when abbreviation is not capitalized and marked.

The typical solution for the overall problem of abbreviations is summarized in [1]:

A solid solution to the problem of upper-case tokens is to assume that in a well-developed system all genuine acronyms will appear in the lexicon. Upon encountering a capitalised token, we first check whether it is in the lexicon and, if so, just treat it as a normal word. If not, we then split the token into a series of single-character tokens and designate each as a letter. There is a possible alternative, where we examine upper-case tokens and attempt to classify them as either acronyms or letter sequences. Such a classification would be based on some notion of “pronounceability” such that if the token contains vowel characters in certain places then this might lead us to think that it is an acronym instead of a letter sequence. Experience has shown that this approach seldom works (at least in English) since there are plenty of potentially pronounceable tokens (e.g. USA and IRA) that sound absurd if we treat them as acronyms (e.g. /y uw s ax/ and /ay r ax/). If a mistake is to be made, it is generally better to pronounce an acronym as a letter sequence rather than the other way round.

Simple cases

To narrow down the task, let’s first outline use cases that should be spelled letter by letter without any ambiguity.

[A-Za-z] character with the dot after it

This applies to both lower- and upper-cases (both “f.b.i.” and “F.B.I.” are possible). We also assume that a single character with the dot after it should be spelled out too. It might be incorrect if the article “a” or pronoun “I” is at the end of a sentence. But those cases are grammatically incorrect, so let’s assume those cases are initialisms, shortened first names, or list indices, i.e., they should be spelled out. Another potential issue is a single roman digit at the end of the sentence. For those cases, let’s assume it’s an abbreviation unless it’s not in the list of exceptions for roman numbers, such as “Clement I.” This rule can be expressed by simple acceptor [2]:

delete_dot = pynutil.delete(".")
dot_abbr = pynini.closure((UPPER | (LOWER @ TO_UPPER)) + delete_dot, 1)

consonants-only

When a word consists of only consonants (both upper- and lower-case), there is no way to read it other than spelling. The only gotcha is the letter “y,” which can act like a vowel, for example, “by.” Acceptor looks like this:

CONSONANTS = 'bcdfghjklmnpqrstvwxz'

def _any_element_lower(element_lst_):
    return pynini.union(*[pynini.accep(x.lower()) @ pynini.closure(TO_UPPER) for x in element_lst_])

def _any_element_upper(element_lst_):
    return pynini.union(*[pynini.accep(x.upper()) for x in element_lst_])

def _any_element(element_lst_):
    return _any_element_lower(element_lst_) | _any_element_upper(element_lst_)

consonant_abbr = pynini.closure(_any_element(CONSONANTS), 1)

vowels only

When a word contains only vowels, it should also be spelled out. However, single-letter is likely not an abbreviation (consider “I” or “U”). For upper-case, it makes sense to treat any sequence of vowels starting from length 2 as an abbreviation (for ex. “AI” or “IEEE”). For lower-case, there are two-letter vowel-only words (for ex., interjection “oi”). So to be safe, a word should be considered an abbreviation only if it contains 3 lower-case vowels and more.

VOWELS = 'aeiou'

vowel_abbr = pynini.closure(_any_element_upper(VOWELS), 2) |
             pynini.closure(_any_element_lower(VOWELS), 3)

Vocabulary-based

As specified in the quote from [1], for non-obvious cases, dictionary-based disambiguation is needed. Abbreviations without dots that have both vowels and consonants (both upper- and lower-case) can be confusing, and an abbreviation dictionary allows to hot-fix issues. However one may want to maintain multiple dictionaries.

Acronyms vocabulary

Acronyms - are initialisms that are pronounced rather than spelled. In other words, the purpose of this vocabulary is to contain “negative” examples that shouldn’t be classified as initialisms and should be pronounced following g2p rules for common words. Typical examples are “NASA,” “NATO,” or “AIDS.”

Cased vocabulary

Some abbreviations should be treated as such only in a specific case. For example, “US” is a country, and “us” is a first-person plural pronoun in the objective case. At first glance, only upper case is expected in abbreviations, but there might be a mix of cases, for example, “mRNA.”

Uncased vocabulary

Some abbreviations are case-independent, and it doesn’t matter which case they are written in. They should be spelled. For example, “usa,” “uk” or “iq.”

Difficult-to-pronounce

When looking at a sequence of letters, even in an unknown word, it is possible to say if it can be pronounced or should be spelled. Maintaining a comprehensive dictionary of all abbreviations is very laborious work. Thus, it is worth having an additional heuristic that would capture at least the most apparent cases of abbreviations that are not in vocab.

To define what “difficult-to-pronounce” means, it is worth collecting a list of abbreviations and comparing it with a list of common words. One can even build a classifier having such datasets, but we would be looking to keep it simple to be able to integrate such a solution into FST-based text-normalization.

Datasets

Two datasets are composed and shared in this repository. One is slightly cleaner but significantly smaller, another is much larger, but likely contains annotation errors.

#1: cmudict & wiki based

The abbreviation dataset is composed by merging 3 resources: cmudict acronyms [3]; acronyms parsed from wikipedia [4]; book of acronyms and initialisms [5]. Those resources are combined, excluding duplicates and regular words, to compose a dataset of 3.3k abbreviations. As negative examples, a list of 125k words from cmudict is used.

#2: kestrel based

Kestrel[6] is a Google text normalization system. By normalizing Wikipedia, a dataset for text-normalization research was composed [7]. The dataset is not perfectly clean and certainly contains both false alarms (regular words marked as abbreviations) as well as misses (abbreviations marked as regular words). However, the quantity of the data is massive, which may outweigh the annotation mistakes. To avoid long tail of semantically unreasonable input, both abbreviations and words can be filtered by frequency. For example, taking only abbreviations/words that occur at least 4 times gives a dataset of 48k abbreviations and 688k common words.

Abbreviation n-grams

Without building a classifier, the easiest way to detect sequences of characters that are difficult to read is to count n-grams and check which are frequent for abbreviations but rare for common words. We can compose an FST acceptor, which will detect abbreviations that are not in any of the vocabularies. For both datasets, only n-grams which are never observed for common words, are selected. Below are some examples to give a taste what of kind of letter sequences are abbreviation specific. “^” - means start of word; “$” - end of it.

2-grams examples