<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" ><generator uri="https://jekyllrb.com/" version="3.10.0">Jekyll</generator><link href="https://balacoon.com/feed.xml" rel="self" type="application/atom+xml" /><link href="https://balacoon.com/" rel="alternate" type="text/html" /><updated>2025-02-17T22:13:28+00:00</updated><id>https://balacoon.com/feed.xml</id><title type="html">Balacoon</title><subtitle>Text-to-speech tools and resources</subtitle><author><name>Balacoon</name></author><entry><title type="html">Speech Generation Evaluation and Leaderboard</title><link href="https://balacoon.com/blog/tts_leaderboard/" rel="alternate" type="text/html" title="Speech Generation Evaluation and Leaderboard" /><published>2025-02-16T00:00:00+00:00</published><updated>2025-02-16T05:20:02+00:00</updated><id>https://balacoon.com/blog/tts_leaderboard</id><content type="html" xml:base="https://balacoon.com/blog/tts_leaderboard/"><![CDATA[<p>As speech generation research grows, many new models have emerged.
For a comprehensive list, see <a href="https://huggingface.co/datasets/Pendrokar/open_tts_tracker">open_tts_tracker</a>.
Similar to LLM evaluation, <a href="https://huggingface.co/spaces/TTS-AGI/TTS-Arena">TTSArena</a> (and its <a href="https://huggingface.co/spaces/Pendrokar/TTS-Spaces-Arena">clone</a>)
provides rankings of popular systems.</p>

<p>While ELO-based ranking relies on human judgment, objective metrics are also commonly used for leaderboards.
Though speech generation lacks a universal standard for objective evaluation,
the community has developed several useful metrics:</p>

<ul>
  <li>Intelligibility - measured by speech recognition error rates on synthetic speech</li>
  <li>Naturalness - predicted using models trained on human naturalness ratings</li>
  <li>Similarity - for voice cloning systems, measured by cosine similarity between speaker embeddings of reference and generated speech</li>
</ul>

<h2 id="we-would-like-to-introduce">We would like to introduce:</h2>

<h3 id="speech_gen_eval-github"><code class="language-plaintext highlighter-rouge">speech_gen_eval</code> (<a href="https://github.com/balacoon/speech_gen_eval">github</a>)</h3>
<p>An open-source library for objective evaluation of speech generation models</p>

<h3 id="speech_gen_eval_testsets-huggingface"><code class="language-plaintext highlighter-rouge">speech_gen_eval_testsets</code> (<a href="https://huggingface.co/datasets/balacoon/speech_gen_eval_testsets">huggingface</a>)</h3>
<p>A collection of test sets for evaluating speech generation models</p>

<h3 id="speech_gen_baselines-huggingface"><code class="language-plaintext highlighter-rouge">speech_gen_baselines</code> (<a href="https://huggingface.co/datasets/balacoon/speech_gen_baselines">huggingface</a>)</h3>
<p>A dataset containing synthetic speech from common models and their evaluation results</p>

<h3 id="ttsleaderboard-huggingface"><code class="language-plaintext highlighter-rouge">TTSLeaderboard</code> (<a href="https://huggingface.co/spaces/balacoon/TTSLeaderboard">huggingface</a>)</h3>
<p>A leaderboard for speech generation models</p>

<p>There are other tools for objective evaluation of speech generation, including <a href="https://github.com/Edresson/ZS-TTS-Evaluation">ZS-TTS-Evaluation</a>, <a href="https://github.com/BytedanceSpeech/seed-tts-eval">seed-tts-eval</a>, <a href="https://github.com/neonbjb/tts-scores">tts-scores</a>, <a href="https://github.com/keonlee9420/evaluate-zero-shot-tts">evaluate-zero-shot-tts</a>, and another leaderboard called <a href="https://huggingface.co/spaces/ttsds/benchmark">TTSDS Scores</a>.
Our contribution focuses on:</p>
<ul>
  <li>Simple addition of new metrics and test sets</li>
  <li>Support for various speech generation models (vocoders, zero-shot TTS, zero-shot VC, etc.), where tools can be reused</li>
  <li>Mature engineering with easy installation and fast evaluation</li>
  <li>Allowing to listen to the generated speech side-by-side to build a relation between metrics and perceived quality</li>
</ul>

<p>Having most of these tools internally for a while, we are now making them publicly available.</p>

<p>We plan to expand the leaderboard with more models and add new metrics to the library.
We welcome your suggestions and contributions!</p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="leaderboard" /><category term="evaluation" /><summary type="html"><![CDATA[As speech generation research grows, many new models have emerged. For a comprehensive list, see open_tts_tracker. Similar to LLM evaluation, TTSArena (and its clone) provides rankings of popular systems.]]></summary></entry><entry><title type="html">Tracing mHuBERT</title><link href="https://balacoon.com/blog/mhubert_tracing/" rel="alternate" type="text/html" title="Tracing mHuBERT" /><published>2025-02-10T00:00:00+00:00</published><updated>2025-02-10T05:20:02+00:00</updated><id>https://balacoon.com/blog/mhubert_tracing</id><content type="html" xml:base="https://balacoon.com/blog/mhubert_tracing/"><![CDATA[<p>Clustering of self-supervised embeddings is commonly used as “semantic” tokens in audio generation.
Typically wav2vec 2.0 or HuBERT outputs are used.
A 95M-params multilingual HuBERT model (147 languages) was <a href="https://huggingface.co/utter-project/mHuBERT-147">released</a> recently. Despite its modest size, the model is competetive on the SUPERB leaderboard.
This makes it a string candidate for use as a semantic tokens extractor in multilingual speech generation
experiments.</p>

<h2 id="tracing">Tracing</h2>

<p>Converting model into Torch JIT file allows to run inference with minimal dependencies (just PyTorch) or deployment on inference servers (e.g. Triton).
In this tracing, we integrate the clustering step into the Torch module, eliminating the need to carry around custom clustering code.</p>

<p>Unfortunately, the FAISS index for clustering step is only available for model after the second iteration.
As a result, the traced model is slightly less capable.</p>

<p>The traced model is available at <a href="https://huggingface.co/balacoon/mhubert-147">balacoon/mhubert-147</a>.
The full notebook used for tracing and testing can be found <a href="https://github.com/balacoon/balacoon.github.io/blob/master/assets/posts/mhubert/trace_hubert.ipynb">here</a>.</p>

<p>Many thanks to @dathudeptrai for <a href="https://huggingface.co/utter-project/mHuBERT-147/discussions/6">posting</a> a snippet on discrete tokens extraction.</p>

<h2 id="notes">Notes</h2>

<p>Here are some notes and practical findings from the mHuBERT model tracing.
Please drop a message if any of these are incorrect or incomplete.</p>
<ul>
  <li>The attention mask is ignored by mHuBERT. As a result, during batched inference, you can get different discrete codes depending on the padding:</li>
</ul>
<figure style="width: 500px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/mhubert/batching.png" alt="" />
  <figcaption class="figure-caption text-center">Effect of padding during batching</figcaption>
</figure>

<ul>
  <li>mHuBERT applies mean/std normalization to input audio.</li>
  <li><code class="language-plaintext highlighter-rouge">faiss</code> has a lot of clustering methods implemented. Fortunately a linear transformation was used for clustering in mHuBERT, allowing it to be extracted into a transformation matrix. See <code class="language-plaintext highlighter-rouge">TorchFaiss</code> in the notebook for details.</li>
</ul>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="HuBERT" /><category term="tracing" /><summary type="html"><![CDATA[Clustering of self-supervised embeddings is commonly used as “semantic” tokens in audio generation. Typically wav2vec 2.0 or HuBERT outputs are used. A 95M-params multilingual HuBERT model (147 languages) was released recently. Despite its modest size, the model is competetive on the SUPERB leaderboard. This makes it a string candidate for use as a semantic tokens extractor in multilingual speech generation experiments.]]></summary></entry><entry><title type="html">Super-resolution for TTS data</title><link href="https://balacoon.com/blog/dataset-super-resolution/" rel="alternate" type="text/html" title="Super-resolution for TTS data" /><published>2025-01-27T00:00:00+00:00</published><updated>2025-01-27T05:20:02+00:00</updated><id>https://balacoon.com/blog/dataset-super-resolution</id><content type="html" xml:base="https://balacoon.com/blog/dataset-super-resolution/"><![CDATA[<p>Zero-shot speech generation requires massive amounts of data.
Large-scale speech datasets however are commonly collected for ASR and therefore sampled at 16khz.
LibriTTS-R<a href="#1">[1]</a> work suggests that audio enhancement and super-resolution methods can be beneficial for TTS data processing. This blog compares a few open-source upsampling methods,
aiming a usecase of preparing a TTS dataset (24kHz) from an ASR one (16kHz).</p>

<h2 id="methods">Methods</h2>

<p>We compare the following methods:</p>

<ul>
  <li>SoX - vanilla upsampling via interpolation, the higher frequencies are not restored.</li>
  <li><a href="https://github.com/resemble-ai/resemble-enhance">Resemble Enhance</a> - diffusion-based denoising / enhancement / upsampling.</li>
  <li><a href="https://github.com/haoheliu/versatile_audio_super_resolution">AudioSR</a> - diffusion-based audio super resolution.</li>
  <li><a href="https://yxlu-0102.github.io/AP-BWE/">AP-BWE</a> - GAN-based bandwidth extension in spectral domain.</li>
</ul>

<p>Processing speed per file on a single RTX 3090:</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th style="text-align: center">Resemble Enhance</th>
      <th style="text-align: center">AudioSR</th>
      <th style="text-align: center">AP-BWE</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>Processing speed per file, sec</td>
      <td style="text-align: center">8.0</td>
      <td style="text-align: center">5.0</td>
      <td style="text-align: center">0.035</td>
    </tr>
  </tbody>
</table>

<p>We measure intelligibility, audio quality and speaker similarity of the processed audio on two datasets: VCTK<a href="#2">[2]</a> and DAPS<a href="#3">[3]</a>. First represents clean audio that simply lacks higher frequencies. Second - is a more challenging usecase of speech recorded on consumer microphones with background noise present.
We use ECAPA<a href="#4">[4]</a> speaker encoder by <a href="https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb">Speechbrain</a> to extract speaker representation and measure speaker similarity. We run speech recognition with <a href="https://huggingface.co/nvidia/stt_en_conformer_transducer_xlarge#model-architecture">Conformer-Transducer ASR model by NVIDIA</a> to evaluate intelligibility in terms of Character Error Rate. Finally, we use a pre-trained Mean Opinion Score estimator UTMOS<a href="#5">[5]</a> to access the naturalness.
Keep in mind that all the metrics are computed on 16khz audio, so they are mainly tracking if upsampling
introduces any changes to the information that is already there.</p>

<h2 id="vctk-testset">VCTK testset</h2>

<p>A 2k subset is sampled.</p>

<table>
  <thead>
    <tr>
      <th>Method</th>
      <th style="text-align: center">Naturalness(MOS↑)</th>
      <th style="text-align: center">Intelligibility(CER, %↓)</th>
      <th style="text-align: center">Similarity(inverted cosine distance↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>original audio</td>
      <td style="text-align: center">4.078</td>
      <td style="text-align: center">0.178</td>
      <td style="text-align: center">0</td>
    </tr>
    <tr>
      <td>SoX</td>
      <td style="text-align: center">4.075</td>
      <td style="text-align: center">0.178</td>
      <td style="text-align: center">0.002</td>
    </tr>
    <tr>
      <td>Resemble Enhance</td>
      <td style="text-align: center">3.86</td>
      <td style="text-align: center">0.18</td>
      <td style="text-align: center">0.079</td>
    </tr>
    <tr>
      <td>AudioSR</td>
      <td style="text-align: center">4.05</td>
      <td style="text-align: center">0.178</td>
      <td style="text-align: center">0.039</td>
    </tr>
    <tr>
      <td>AP-BWE</td>
      <td style="text-align: center">4.06</td>
      <td style="text-align: center">0.178</td>
      <td style="text-align: center">0.042</td>
    </tr>
  </tbody>
</table>

<p>Example of the upsampling in spectral domain:</p>

<figure style="width: 800px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/dataset_super_resolution/vctk_example.png" alt="" />
</figure>

<p>Audio samples:</p>

<iframe src="https://balacoonwebsite.s3.eu-north-1.amazonaws.com/posts/dataset_super_resolution/sr_vctk_demo.html" width="800" height="600"></iframe>

<h2 id="daps-testset">DAPS testset</h2>

<p>Dataset is segmented into sentence-level, a 2k subset is sampled.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th style="text-align: center">Naturalness(MOS↑)</th>
      <th style="text-align: center">Intelligibility(CER, %↓)</th>
      <th style="text-align: center">Similarity(inverted cosine distance↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>original audio</td>
      <td style="text-align: center">2.48</td>
      <td style="text-align: center">2.75</td>
      <td style="text-align: center">0</td>
    </tr>
    <tr>
      <td>SoX</td>
      <td style="text-align: center">2.48</td>
      <td style="text-align: center">2.748</td>
      <td style="text-align: center">0.008</td>
    </tr>
    <tr>
      <td>Resemble Enhance</td>
      <td style="text-align: center">3.32</td>
      <td style="text-align: center">12.98</td>
      <td style="text-align: center">0.43</td>
    </tr>
    <tr>
      <td>AudioSR</td>
      <td style="text-align: center">2.45</td>
      <td style="text-align: center">3.15</td>
      <td style="text-align: center">0.07</td>
    </tr>
    <tr>
      <td>AP-BWE</td>
      <td style="text-align: center">2.456</td>
      <td style="text-align: center">2.73</td>
      <td style="text-align: center">0.007</td>
    </tr>
  </tbody>
</table>

<p>Example of the upsampling in spectral domain:</p>

<figure style="width: 800px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/dataset_super_resolution/daps_example.png" alt="" />
</figure>

<p>Audio samples:</p>

<iframe src="https://balacoonwebsite.s3.eu-north-1.amazonaws.com/posts/dataset_super_resolution/sr_daps_demo.html" width="800" height="600"></iframe>

<h2 id="conclusions">Conclusions</h2>

<p><code class="language-plaintext highlighter-rouge">Resemble-Enhance</code> strives to also perform denoising and enhancement.
It corrupts the noisy audio files substantially which is reflected in greatly degraded intelligibility.
Both <code class="language-plaintext highlighter-rouge">AudioSR</code> and <code class="language-plaintext highlighter-rouge">AP-BWE</code> are very gentle to existing information and do not change the metrics.
Former adds more details and combines with existing high-freq information more smoothly.
Latter is however almost 150x faster. Our pick is <code class="language-plaintext highlighter-rouge">AudioSR</code> if the amount of data is managable, otherwise <code class="language-plaintext highlighter-rouge">AP-BWE</code>.</p>

<h2 id="references">References</h2>
<p><a id="1">[1]</a>
<a href="https://arxiv.org/abs/2305.18802">LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus</a></p>

<p><a id="2">[2]</a>
<a href="https://datashare.ed.ac.uk/handle/10283/3443">CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit</a></p>

<p><a id="3">[3]</a>
<a href="https://ccrma.stanford.edu/~gautham/Site/daps_files/mysore-spl2015.pdf">Device and Produced Speech (DAPS) Dataset</a></p>

<p><a id="4">[4]</a>
<a href="https://arxiv.org/abs/2104.01466">ECAPA-TDNN Embeddings for Speaker Diarization</a></p>

<p><a id="5">[5]</a>
<a href="https://arxiv.org/abs/2204.02152">UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022</a></p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="data" /><category term="super resolution" /><category term="upsampling" /><summary type="html"><![CDATA[Zero-shot speech generation requires massive amounts of data. Large-scale speech datasets however are commonly collected for ASR and therefore sampled at 16khz. LibriTTS-R[1] work suggests that audio enhancement and super-resolution methods can be beneficial for TTS data processing. This blog compares a few open-source upsampling methods, aiming a usecase of preparing a TTS dataset (24kHz) from an ASR one (16kHz).]]></summary></entry><entry><title type="html">Streaming Inference with Convolutional Layers</title><link href="https://balacoon.com/blog/streaming_inference/" rel="alternate" type="text/html" title="Streaming Inference with Convolutional Layers" /><published>2024-04-20T00:00:00+00:00</published><updated>2024-04-20T05:20:02+00:00</updated><id>https://balacoon.com/blog/streaming_inference</id><content type="html" xml:base="https://balacoon.com/blog/streaming_inference/"><![CDATA[<p>In this post, we explore how to apply convolutional layers to infinitely long inputs, specifically focusing on how to process inputs in chunks to minimize latency. For instance, in text-to-speech applications, instead of synthesizing an entire sentence at once, we prefer to generate and play back audio in segments. While recurrent or autoregressive networks are inherently <code class="language-plaintext highlighter-rouge">causal</code> and thus well-suited for streaming processing, convolutional layers present more challenges and require careful handling.</p>

<h1 id="conv1d">Conv1d</h1>

<p>First, let’s examine a standard convolutional layer. By default, convolutions are <code class="language-plaintext highlighter-rouge">non-causal</code>, meaning the output at any given time may depend on both past and future input values.</p>
<figure style="width: 400px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/non_causal_conv.png" alt="" />
  <figcaption class="figure-caption text-center">Non-causal convolution</figcaption>
</figure>

<p>To achieve output of the same size as the input, we pad the input on both sides by the <code class="language-plaintext highlighter-rouge">receptive_field</code> of the convolution layer, defined as <code class="language-plaintext highlighter-rouge">kernel_size // 2</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>

<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">100</span><span class="p">)</span>  <span class="c1"># (batch x channels x time)
</span><span class="n">kernel_size</span> <span class="o">=</span> <span class="mi">7</span>
<span class="n">receptive_field</span> <span class="o">=</span> <span class="n">kernel_size</span> <span class="o">//</span> <span class="mi">2</span>
<span class="n">non_causal_conv_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span>
    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># input channels
</span>    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># output channels
</span>    <span class="n">kernel_size</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="n">receptive_field</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">y</span> <span class="o">=</span> <span class="n">non_causal_conv_layer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>

<p>For chunked inference, padding must be applied manually, and the input
shifted by <code class="language-plaintext highlighter-rouge">chunk_size - 2 * receptive_field</code> for each subsequent chunk.</p>
<figure style="width: 400px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/non_causal_conv_chunks.png" alt="" />
  <figcaption class="figure-caption text-center">Non-causal convolution in chunks</figcaption>
</figure>

<p>This can be implemented as follows:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">non_causal_chunk_conv_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span>
    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># input channels
</span>    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># output channels
</span>    <span class="n">kernel_size</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>  <span class="c1"># we will do padding manually
</span><span class="p">)</span>
<span class="c1"># copy the weights from the original conv layer
</span><span class="n">non_causal_chunk_conv_layer</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">non_causal_conv_layer</span><span class="p">.</span><span class="n">weight</span>
<span class="c1"># pad the input by receptive field on both sides
</span><span class="n">padded_x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">pad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="n">receptive_field</span><span class="p">,</span> <span class="n">receptive_field</span><span class="p">))</span>

<span class="c1"># run inference in a loop on chunk_size
</span><span class="n">chunk_outputs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="k">while</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">padded_x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">receptive_field</span><span class="p">:</span>
    <span class="n">chunk</span> <span class="o">=</span> <span class="n">padded_x</span><span class="p">[:,</span> <span class="p">:,</span> <span class="n">i</span><span class="p">:</span> <span class="n">i</span> <span class="o">+</span> <span class="n">chunk_size</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">receptive_field</span><span class="p">]</span>
    <span class="n">chunk_outputs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span>
        <span class="n">non_causal_chunk_conv_layer</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">i</span> <span class="o">+=</span> <span class="n">chunk_size</span>
<span class="n">chunked_y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">chunk_outputs</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">chunked_y</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">chunked_y</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>

<p>If you have a stack of convolutional layers, their receptive fields simply add up, but the method remains the same.</p>

<h1 id="causal-conv1d">Causal Conv1d</h1>

<p>For online processing (such as live denoising or voice conversion), latency is influenced by both <code class="language-plaintext highlighter-rouge">chunk_size</code> and the <code class="language-plaintext highlighter-rouge">receptive_field</code> of the convolutional kernel on the right, also known as lookahead. While chunk size is adjustable, the receptive field is limited by the architecture. To reduce latency, one should aim to design a convolution with an asymmetrical receptive field. In the extreme case, with no lookahead, this results in a <code class="language-plaintext highlighter-rouge">causal</code> convolutional layer:</p>
<figure style="width: 400px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/causal_conv.png" alt="" />
  <figcaption class="figure-caption text-center">Causal convolution</figcaption>
</figure>

<p>This is achieved by asymmetrically padding the convolution, padding only on the left by <code class="language-plaintext highlighter-rouge">kernel_size - 1</code>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">causal_conv_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span>
    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># input channels
</span>    <span class="mi">1</span><span class="p">,</span>  <span class="c1"># output channels
</span>    <span class="n">kernel_size</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>  <span class="c1"># need to do padding manually for assymetric case
</span><span class="p">)</span>
<span class="n">padded_x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">pad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="n">kernel_size</span> <span class="o">-</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">))</span>

<span class="n">y</span> <span class="o">=</span> <span class="n">causal_conv_layer</span><span class="p">(</span><span class="n">padded_x</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">x</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
</code></pre></div></div>

<p>Inference in chunks does not differ significantly from a regular convolution, except that there is only one receptive field located on the left of the input.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># run inference in a loop on chunk_size
</span><span class="n">chunk_outputs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">receptive_field</span> <span class="o">=</span> <span class="n">kernel_size</span> <span class="o">-</span> <span class="mi">1</span>
<span class="k">while</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">padded_x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="n">receptive_field</span><span class="p">:</span>
    <span class="n">chunk</span> <span class="o">=</span> <span class="n">padded_x</span><span class="p">[:,</span> <span class="p">:,</span> <span class="n">i</span><span class="p">:</span> <span class="n">i</span> <span class="o">+</span> <span class="n">chunk_size</span> <span class="o">+</span> <span class="n">receptive_field</span><span class="p">]</span>
    <span class="n">chunk_outputs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span>
        <span class="n">causal_conv_layer</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
    <span class="p">)</span>
    <span class="n">i</span> <span class="o">+=</span> <span class="n">chunk_size</span>
<span class="n">chunked_y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">chunk_outputs</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">chunked_y</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">chunked_y</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="transposed-conv1d">Transposed Conv1d</h1>

<p>In audio or image processing, low-dimensional latent representations often need to be upsampled back to samples or pixels. This is achieved through transposed convolution with strides. A detailed explanation of this can be found in a <a href="https://medium.com/@santi.pdp/how-pytorch-transposed-convs1d-work-a7adac63c4a5">blogpost</a> on the topic. In short, each input point expands into multiple output points. The <code class="language-plaintext highlighter-rouge">stride</code> determines the degree of upsampling performed by the transposed convolution, usually set so <code class="language-plaintext highlighter-rouge">kernel_size = stride * 2</code> to prevent <a href="https://distill.pub/2016/deconv-checkerboard/">checkboard artifacts</a>. Two neighboring input points contribute to each output point. Padding in this case actually reduces the number of output points at the edges, ensuring that <code class="language-plaintext highlighter-rouge">stride * len(input)</code> output points are produced.</p>

<figure style="width: 500px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/transposed_conv.png" alt="" />
  <figcaption class="figure-caption text-center">Transposed convolution with stride</figcaption>
</figure>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>

<span class="n">upsample_rate</span> <span class="o">=</span> <span class="mi">4</span>
<span class="n">kernel_size</span> <span class="o">=</span> <span class="n">upsample_rate</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">upsample_rate</span> <span class="o">%</span> <span class="mi">2</span>
<span class="n">padding</span> <span class="o">=</span> <span class="p">(</span><span class="n">kernel_size</span> <span class="o">-</span> <span class="n">upsample_rate</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span>

<span class="n">transposed_conv_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ConvTranspose1d</span><span class="p">(</span>
    <span class="n">in_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">out_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">kernel_size</span><span class="o">=</span><span class="n">kernel_size</span><span class="p">,</span>
    <span class="n">stride</span><span class="o">=</span><span class="n">upsample_rate</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="n">padding</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">y</span> <span class="o">=</span> <span class="n">transposed_conv_layer</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>  <span class="c1"># (1, 1, 400)
</span><span class="k">print</span><span class="p">(</span><span class="n">y</span><span class="p">.</span><span class="n">shape</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">),</span> <span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">*</span> <span class="n">upsample_rate</span><span class="p">)</span>
</code></pre></div></div>

<p>Running transposed convolution in chunks is similar to regular convolution: edges of the output are trimmed, input is padded, and inference is performed on overlapping chunks.</p>
<figure style="width: 600px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/transposed_conv_chunks.png" alt="" />
  <figcaption class="figure-caption text-center">Transposed convolution with stride in chunks</figcaption>
</figure>

<p>Computing parameters for streaming inference differs from regular convolution:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># we will run inference with overlap,
# which needs to be taken into account
# in the slicing
</span><span class="n">extra_samples</span> <span class="o">=</span> <span class="p">(</span><span class="n">kernel_size</span> <span class="o">-</span> <span class="n">upsample_rate</span><span class="p">)</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">upsample_rate</span> <span class="o">%</span> <span class="mi">2</span>  <span class="c1"># how much extra output samples on the left and right
</span><span class="n">transposed_chunk_conv_layer</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ConvTranspose1d</span><span class="p">(</span>
    <span class="n">in_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">out_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
    <span class="n">kernel_size</span><span class="o">=</span><span class="n">kernel_size</span><span class="p">,</span>
    <span class="n">stride</span><span class="o">=</span><span class="n">upsample_rate</span><span class="p">,</span>
    <span class="n">padding</span><span class="o">=</span><span class="n">extra_samples</span><span class="p">,</span>
    <span class="n">bias</span><span class="o">=</span><span class="bp">False</span>
<span class="p">)</span>
<span class="n">transposed_chunk_conv_layer</span><span class="p">.</span><span class="n">weight</span> <span class="o">=</span> <span class="n">transposed_conv_layer</span><span class="p">.</span><span class="n">weight</span>

<span class="n">chunk_outputs</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">20</span>
<span class="n">i</span> <span class="o">=</span> <span class="mi">0</span>
<span class="c1"># each output contributed by 2 inputs, so overlap is 1
</span><span class="n">overlap</span> <span class="o">=</span> <span class="n">kernel_size</span> <span class="o">//</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">upsample_rate</span><span class="p">)</span>
<span class="c1"># need to pad so edges are handled correctly,
# this padding is taken into account in slicing
</span><span class="n">padded_x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">functional</span><span class="p">.</span><span class="n">pad</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="p">(</span><span class="n">overlap</span><span class="p">,</span> <span class="n">overlap</span><span class="p">))</span>
<span class="k">while</span> <span class="n">i</span> <span class="o">&lt;</span> <span class="n">padded_x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">-</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">overlap</span><span class="p">:</span>
    <span class="n">chunk</span> <span class="o">=</span> <span class="n">padded_x</span><span class="p">[:,</span> <span class="p">:,</span> <span class="n">i</span><span class="p">:</span> <span class="n">i</span> <span class="o">+</span> <span class="n">chunk_size</span> <span class="o">+</span> <span class="mi">2</span> <span class="o">*</span> <span class="n">overlap</span><span class="p">]</span>
    <span class="n">res</span> <span class="o">=</span> <span class="n">transposed_chunk_conv_layer</span><span class="p">(</span><span class="n">chunk</span><span class="p">)</span>
    <span class="n">chunk_outputs</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">res</span><span class="p">)</span>
    <span class="n">i</span> <span class="o">+=</span> <span class="n">chunk_size</span>
<span class="n">chunked_y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">chunk_outputs</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">chunked_y</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">chunked_y</span> <span class="o">==</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="fourier-transform">Fourier transform</h1>

<p>Many image and audio processing techniques still incorporate elements from classical signal processing. For audio, it’s common to extract a spectrogram to downsample the redundant audio signal while preserving the most relevant information. During training, this can be achieved using <code class="language-plaintext highlighter-rouge">torch.stft</code>. When deploying the model, however, there are challenges in tracing this operation across different CPU and GPU precisions. A workaround involves reformulating spectrogram extraction as a convolution with strides. This approach is already implemented in <a href="https://github.com/KinWaiCheuk/nnAudio">nnAudio</a>. Here, the STFT is executed with a precomputed convolution where the kernel size matches the number of FFT points and the stride equals the hop size between windows.</p>

<p>Extracting spectrogram looks like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">nnAudio.features.stft</span> <span class="kn">import</span> <span class="n">STFT</span>

<span class="n">win_length</span> <span class="o">=</span> <span class="mi">1024</span>
<span class="n">downsample_rate</span> <span class="o">=</span> <span class="mi">320</span>
<span class="n">stft</span> <span class="o">=</span> <span class="n">STFT</span><span class="p">(</span>
    <span class="n">n_fft</span><span class="o">=</span><span class="n">win_length</span><span class="p">,</span>
    <span class="n">win_length</span><span class="o">=</span><span class="n">win_length</span><span class="p">,</span>
    <span class="n">hop_length</span><span class="o">=</span><span class="n">downsample_rate</span><span class="p">,</span>
    <span class="c1"># disabling padding
</span>    <span class="c1"># https://github.com/KinWaiCheuk/nnAudio/blob/9e9a4bad230d175f7ad541309829483f1274a3e5/Installation/nnAudio/features/stft.py#L278
</span>    <span class="n">center</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
    <span class="n">output_format</span><span class="o">=</span><span class="s">"Magnitude"</span><span class="p">,</span>
    <span class="n">pad_mode</span><span class="o">=</span><span class="s">"constant"</span>
<span class="p">)</span>

<span class="n">total_frames</span> <span class="o">=</span> <span class="mi">30</span>
<span class="n">total_samples</span> <span class="o">=</span> <span class="n">win_length</span> <span class="o">+</span> <span class="p">(</span><span class="n">total_frames</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">downsample_rate</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">total_samples</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">stft</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">y</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span> <span class="o">==</span> <span class="n">total_frames</span>
</code></pre></div></div>

<p>When computing the spectrogram in chunks, the same approach is applied as with causal convolution:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">chunk_size_samples</span> <span class="o">=</span> <span class="n">chunk_size</span> <span class="o">*</span> <span class="n">downsample_rate</span>
<span class="c1"># overlap between the frames
</span><span class="n">receptive</span> <span class="o">=</span> <span class="n">win_length</span> <span class="o">-</span> <span class="n">downsample_rate</span>
<span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">chunked_y_lst</span> <span class="o">=</span> <span class="p">[]</span>
<span class="k">while</span> <span class="n">start</span> <span class="o">&lt;=</span> <span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">-</span> <span class="n">chunk_size_samples</span> <span class="o">-</span> <span class="n">receptive</span><span class="p">:</span>
    <span class="n">chunk</span> <span class="o">=</span> <span class="n">x</span><span class="p">[:,</span> <span class="n">start</span><span class="p">:</span><span class="n">start</span> <span class="o">+</span> <span class="n">chunk_size_samples</span> <span class="o">+</span> <span class="n">receptive</span><span class="p">]</span>
    <span class="n">chunked_y_lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">stft</span><span class="p">(</span><span class="n">chunk</span><span class="p">))</span>
    <span class="n">start</span> <span class="o">+=</span> <span class="n">chunk_size_samples</span>
<span class="n">chunked_y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">chunked_y_lst</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>
<span class="k">assert</span> <span class="n">chunked_y</span><span class="p">.</span><span class="n">shape</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">shape</span>
<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">chunked_y</span> <span class="o">-</span> <span class="n">y</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mf">1e-3</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="inverse-fourier-transform">Inverse Fourier transform</h1>

<p>The inverse Fourier transform is surprisingly more complex. Let’s revisit the audio example to understand why. Overlapping frames create interesting patterns that influence which frames affect which samples in the output.</p>
<figure style="width: 400px" class="align-center">
  <img src="https://balacoon.com/assets/images/posts/streaming_inference/iverse_fourier_transform.png" alt="" />
  <figcaption class="figure-caption text-center">Overlapping frames in the Inverse Fourier transform</figcaption>
</figure>
<p>In the illustration above, a chunk of 6 frames is shown with framing parameters of <code class="language-plaintext highlighter-rouge">n_fft = 1024</code> and <code class="language-plaintext highlighter-rouge">hop_length = 320</code>. Since <code class="language-plaintext highlighter-rouge">n_fft % hop_length != 0</code>, the number of frames that affect the output samples varies between 3 and 4. For the edges of the input, it is fewer, and these regions should be considered the receptive field.</p>

<p>Just like before, executing the inverse Short-Time Fourier Transform (iSTFT) on the entire input:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">nnAudio.features.stft</span> <span class="kn">import</span> <span class="n">iSTFT</span>

<span class="n">win_length</span> <span class="o">=</span> <span class="mi">1024</span>
<span class="n">upsample_rate</span> <span class="o">=</span> <span class="mi">320</span>
<span class="n">istft</span> <span class="o">=</span> <span class="n">iSTFT</span><span class="p">(</span>
    <span class="n">n_fft</span><span class="o">=</span><span class="n">win_length</span><span class="p">,</span>
    <span class="n">win_length</span><span class="o">=</span><span class="n">win_length</span><span class="p">,</span>
    <span class="n">hop_length</span><span class="o">=</span><span class="n">upsample_rate</span><span class="p">,</span>
    <span class="n">center</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
<span class="p">)</span>

<span class="n">total_frames</span> <span class="o">=</span> <span class="mi">100</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">win_length</span><span class="o">//</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">1</span><span class="p">,</span> <span class="n">total_frames</span><span class="p">,</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">istft</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">onesided</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>

<span class="n">receptive</span> <span class="o">=</span> <span class="n">win_length</span> <span class="o">-</span> <span class="n">upsample_rate</span>
<span class="c1"># to have an even upsampling, we should slice half of the receptive.
# this leaves some edge effects however 
</span><span class="n">y_padded</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:,</span> <span class="n">receptive</span> <span class="o">//</span> <span class="mi">2</span><span class="p">:</span><span class="o">-</span><span class="n">receptive</span> <span class="o">//</span> <span class="mi">2</span><span class="p">]</span>
<span class="k">assert</span> <span class="n">y_padded</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">==</span> <span class="n">total_frames</span> <span class="o">*</span> <span class="n">upsample_rate</span>
<span class="c1"># keeping only output that doesn't have edge effects,
# we need to slice off entire receptive field
</span><span class="n">y</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:,</span> <span class="n">receptive</span><span class="p">:</span><span class="o">-</span><span class="n">receptive</span><span class="p">]</span>
</code></pre></div></div>

<p>Running in chunks includes manual slicing from the output of the iSTFT,
to remove regions without boundary effects:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="n">chunked_y_lst</span> <span class="o">=</span> <span class="p">[]</span>
<span class="n">start</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="mi">5</span>
<span class="n">overlap</span> <span class="o">=</span> <span class="nb">int</span><span class="p">(</span><span class="n">win_length</span> <span class="o">/</span> <span class="n">upsample_rate</span><span class="p">)</span>
<span class="k">while</span> <span class="n">start</span> <span class="o">&lt;=</span> <span class="n">total_frames</span> <span class="o">-</span> <span class="n">chunk_size</span><span class="p">:</span>
    <span class="n">chunk</span> <span class="o">=</span> <span class="n">x</span><span class="p">[:,</span> <span class="p">:,</span> <span class="n">start</span><span class="p">:</span><span class="n">start</span> <span class="o">+</span> <span class="n">chunk_size</span> <span class="o">+</span> <span class="n">overlap</span><span class="p">]</span>
    <span class="n">chunk_out</span> <span class="o">=</span> <span class="n">istft</span><span class="p">(</span><span class="n">chunk</span><span class="p">,</span> <span class="n">onesided</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
    <span class="n">left</span> <span class="o">=</span> <span class="n">win_length</span> <span class="o">-</span> <span class="n">win_length</span> <span class="o">%</span> <span class="n">upsample_rate</span>
    <span class="n">right</span> <span class="o">=</span> <span class="n">receptive</span>
    <span class="n">chunk_out</span> <span class="o">=</span> <span class="n">chunk_out</span><span class="p">[:,</span> <span class="n">left</span><span class="p">:</span><span class="o">-</span><span class="n">right</span><span class="p">]</span>

    <span class="n">chunked_y_lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span><span class="n">chunk_out</span><span class="p">)</span>
    <span class="n">start</span> <span class="o">+=</span> <span class="n">chunk_size</span>

<span class="n">chunked_y</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">chunked_y_lst</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>
<span class="c1"># some of the original output is lost
</span><span class="n">lost</span> <span class="o">=</span> <span class="n">upsample_rate</span> <span class="o">-</span> <span class="n">win_length</span> <span class="o">%</span> <span class="n">upsample_rate</span>
<span class="n">y_with_lost</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:,</span> <span class="n">lost</span><span class="p">:</span><span class="n">chunked_y</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">+</span> <span class="n">lost</span><span class="p">]</span>

<span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">chunked_y</span> <span class="o">-</span> <span class="n">y_with_lost</span><span class="p">))</span> <span class="o">&lt;</span> <span class="mf">1e-5</span>
</code></pre></div></div>

<h1 id="putting-it-all-together-transposed-conv">Putting it all together (Transposed Conv)</h1>

<p>Let’s integrate everything and examine how layers might interact in a typical audio-to-audio stack, where audio is first downsampled to a latent representation and then upsampled back. The model might look something like this:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">nnAudio.features.stft</span> <span class="kn">import</span> <span class="n">STFT</span>

<span class="k">def</span> <span class="nf">create_conv_stack</span><span class="p">(</span><span class="n">kernels</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">in_channels</span><span class="p">:</span> <span class="nb">int</span> <span class="o">=</span> <span class="mi">1</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">:</span>
    <span class="s">"""
    Creates a dummy convolutional stack
    """</span>
    <span class="n">lst</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">kernels</span><span class="p">):</span>
        <span class="n">ic</span> <span class="o">=</span> <span class="n">in_channels</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="mi">1</span>
        <span class="n">lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span>
                <span class="n">ic</span><span class="p">,</span>  <span class="c1"># input channels
</span>                <span class="mi">1</span><span class="p">,</span>  <span class="c1"># output channels
</span>                <span class="n">k</span><span class="p">,</span>
                <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
                <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="p">)</span>
        <span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">lst</span><span class="p">)</span>

<span class="k">def</span> <span class="nf">create_transpose_conv</span><span class="p">(</span><span class="n">upsample_rate</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ConvTranspose1d</span><span class="p">:</span>
    <span class="s">"""
    Creates dummy transposed convolutional layer that upsamples the input signal
    by given ratio
    """</span>
    <span class="n">kernel_size</span> <span class="o">=</span> <span class="n">upsample_rate</span> <span class="o">*</span> <span class="mi">2</span> <span class="o">+</span> <span class="n">upsample_rate</span> <span class="o">%</span> <span class="mi">2</span>
    <span class="n">extra_samples</span> <span class="o">=</span> <span class="p">(</span><span class="n">kernel_size</span> <span class="o">-</span> <span class="n">upsample_rate</span><span class="p">)</span> <span class="o">*</span> <span class="mi">3</span> <span class="o">//</span> <span class="mi">2</span> <span class="o">-</span> <span class="n">upsample_rate</span> <span class="o">%</span> <span class="mi">2</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">ConvTranspose1d</span><span class="p">(</span>
        <span class="n">in_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">out_channels</span><span class="o">=</span><span class="mi">1</span><span class="p">,</span>
        <span class="n">kernel_size</span><span class="o">=</span><span class="n">kernel_size</span><span class="p">,</span>
        <span class="n">stride</span><span class="o">=</span><span class="n">upsample_rate</span><span class="p">,</span>
        <span class="n">padding</span><span class="o">=</span><span class="n">extra_samples</span><span class="p">,</span>
        <span class="n">bias</span><span class="o">=</span><span class="bp">False</span>
    <span class="p">)</span>

<span class="s">"""
Finally the model, which is a stack of
STFT -&gt; conv_stack -&gt; in_conv -&gt; upsampling -&gt; conv_stack -&gt; upsampling -&gt; conv_stack -&gt; out_conv
"""</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
    <span class="n">STFT</span><span class="p">(</span>
        <span class="n">n_fft</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">win_length</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">hop_length</span><span class="o">=</span><span class="mi">320</span><span class="p">,</span>
        <span class="n">center</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>  <span class="c1"># disables the padding
</span>        <span class="n">output_format</span><span class="o">=</span><span class="s">"Magnitude"</span><span class="p">,</span>
        <span class="n">pad_mode</span><span class="o">=</span><span class="s">"constant"</span>
    <span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span> <span class="mi">513</span><span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">7</span><span class="p">]),</span>
    <span class="n">create_transpose_conv</span><span class="p">(</span><span class="mi">5</span><span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">11</span><span class="p">]),</span>
    <span class="n">create_transpose_conv</span><span class="p">(</span><span class="mi">64</span><span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">3</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">11</span><span class="p">]),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">7</span><span class="p">]),</span>
<span class="p">)</span>
</code></pre></div></div>

<p>Given what we’ve learnt so far, lets define the receptive field for each layer,
to understand how much context on the left and on the right our model requires</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># define receptive fields of each layer in the stack
# for each layer, specify (left_receptive, right_receptive, resolution)
# notice how STFT and transposed conv layers change the resolution
</span><span class="n">receptives_with_resolutions</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="mi">1024</span> <span class="o">-</span> <span class="mi">320</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>  <span class="c1"># STFT: requires (win_len - hop_size) on the left 
</span>    <span class="p">((</span><span class="mi">5</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span>  <span class="c1"># conv_stack: 3 causal layers with receptive of (kernel_size - 1) 
</span>    <span class="p">(</span><span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span> <span class="c1"># in_conv: non-causal conv layer with symmetric repective of kernel_size // 2
</span>    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span>  <span class="c1"># transposed conv: receptive is defined by overlap = kernel_size // (2 * upsample_rate) == 1
</span>    <span class="p">(</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">4</span> <span class="o">+</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>  <span class="c1"># conv_stack: 3 causal layers with varying kernel_size
</span>    <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">64</span><span class="p">),</span>  <span class="c1"># transposed conv: different upsample rate, but it doesn't affect overlap
</span>    <span class="p">(</span><span class="mi">2</span> <span class="o">+</span> <span class="mi">4</span> <span class="o">+</span> <span class="mi">10</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>  <span class="c1"># conv_stack: another 3 causal layers with varying kernel_size
</span>    <span class="p">(</span><span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="c1"># out_conv: another non-causal conv layer with symmetric repective
</span><span class="p">]</span>
<span class="c1"># bring all to the same resolution (samples)
</span><span class="n">receptives</span> <span class="o">=</span> <span class="p">[(</span><span class="n">left</span> <span class="o">*</span> <span class="n">res</span><span class="p">,</span> <span class="n">right</span> <span class="o">*</span> <span class="n">res</span><span class="p">)</span> <span class="k">for</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">res</span> <span class="ow">in</span> <span class="n">receptives_with_resolutions</span><span class="p">]</span>
<span class="c1"># this is our overlap in case of chunked synthesis
</span><span class="n">left</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">left</span> <span class="k">for</span> <span class="n">left</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">receptives</span><span class="p">])</span>  <span class="c1"># 6931
# and this is our architectural latency, in case of online processing
</span><span class="n">right</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">right</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">right</span> <span class="ow">in</span> <span class="n">receptives</span><span class="p">])</span>  <span class="c1"># 1347
</span></code></pre></div></div>

<p>To confirm that we calculated the receptive field correctly, we can run the model on a dummy input and check the input/output dimensionality. The input length should be compatible with model downsampling. In our case, the input length should generate a whole number of outputs for the <code class="language-plaintext highlighter-rouge">STFT</code> layer. This is the case if <code class="language-plaintext highlighter-rouge">input_length = hop_length * N + win_length</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_length</span> <span class="o">=</span> <span class="mi">320</span> <span class="o">*</span> <span class="mi">50</span> <span class="o">+</span> <span class="mi">1024</span>
<span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">zeros</span><span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="n">input_length</span><span class="p">)</span>
<span class="n">y</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>
<span class="c1"># if we calculated receptive field correctly,
# model should strip off left/right receptives
</span><span class="k">assert</span> <span class="n">input_length</span> <span class="o">-</span> <span class="n">left</span> <span class="o">-</span> <span class="n">right</span> <span class="o">==</span> <span class="n">y</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)</span>
</code></pre></div></div>

<p>From previous examples, for inference in chunks, we need to shift the input by <code class="language-plaintext highlighter-rouge">chunk_size</code>, which is the input size without receptive fields. In our case, the chunk_size is <code class="language-plaintext highlighter-rouge">320 * N + 1024 - left - right</code>. For <code class="language-plaintext highlighter-rouge">N==50</code>, it is <code class="language-plaintext highlighter-rouge">8746</code>. There is a problem, however. We can only shift the input by the stride of the downsampling layer(s), in our case by <code class="language-plaintext highlighter-rouge">M * 320</code>. For most architectures, there is no way to satisfy both requirements:</p>
<ul>
  <li>Shift by <code class="language-plaintext highlighter-rouge">chunk_size</code></li>
  <li>Shift by <code class="language-plaintext highlighter-rouge">M * downsampling_stride</code>
To overcome this issue, we’ll have to drop some extra samples from the output, in order to be able to do inference in chunks:</li>
</ul>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chunk_size</span> <span class="o">=</span> <span class="n">input_length</span> <span class="o">-</span> <span class="n">left</span> <span class="o">-</span> <span class="n">right</span>  <span class="c1"># 8746
</span><span class="n">extra_to_drop</span> <span class="o">=</span> <span class="n">chunk_size</span> <span class="o">%</span> <span class="mi">320</span>  <span class="c1"># 106
</span></code></pre></div></div>

<p>Now we are all set to check if inference in chunks works. As before, for a dummy input we will run inference on the whole input, and then on chunks of the input and compare the results.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># some random input
</span><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">))</span>
<span class="c1"># inference on the whole input
</span><span class="n">y</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

<span class="c1"># running inference on the first chunk
</span><span class="n">y_chunk_1</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="p">:</span><span class="n">input_length</span><span class="p">])</span>
<span class="c1"># running inference on the second chunk, carefully shifting input
</span><span class="n">start</span> <span class="o">=</span> <span class="n">chunk_size</span> <span class="o">-</span> <span class="n">extra_to_drop</span>
<span class="n">y_chunk_2</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="n">start</span><span class="p">:(</span><span class="n">start</span> <span class="o">+</span> <span class="n">input_length</span><span class="p">)])</span>
<span class="c1"># now we need to slice off the `extra_to_drop` from both outputs
</span><span class="n">y_chunk_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">y_chunk_1</span><span class="p">,</span> <span class="n">y_chunk_2</span><span class="p">]</span>
<span class="n">y_chunk_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">chunk</span><span class="p">[:,</span> <span class="p">:</span><span class="o">-</span><span class="n">extra_to_drop</span><span class="p">]</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">y_chunk_list</span><span class="p">]</span>
<span class="n">y_chunk</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">y_chunk_list</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">2</span><span class="p">)</span>

<span class="c1"># finally compare the chunked inference to the original one.
# we run inference only on two chunks, omitting handling the
# padding on the right for simplicity. So the comparison
# is done only for beginning of the output.
</span><span class="n">diff</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:,</span> <span class="p">:,</span> <span class="p">:</span><span class="n">y_chunk</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">)]</span> <span class="o">-</span> <span class="n">y_chunk</span>
<span class="c1"># should be the same
</span><span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="nb">all</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">diff</span><span class="p">)</span> <span class="o">&lt;</span> <span class="mf">1e-3</span><span class="p">)</span>
</code></pre></div></div>

<h1 id="putting-it-all-together-istft">Putting it all together (iSTFT)</h1>

<p>Now, let’s do the same, but replace the upsampling with transposed convolutions and use the inverse STFT (iSTFT). Once again, we will ensure that carefully computing the total receptive field of all layers allows us to run inference on chunks of input.</p>

<p>Model definition:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">typing</span> <span class="kn">import</span> <span class="n">List</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="kn">from</span> <span class="nn">nnAudio.features.stft</span> <span class="kn">import</span> <span class="n">STFT</span><span class="p">,</span> <span class="n">iSTFT</span>

<span class="k">def</span> <span class="nf">create_conv_stack</span><span class="p">(</span><span class="n">kernels</span><span class="p">:</span> <span class="n">List</span><span class="p">[</span><span class="nb">int</span><span class="p">],</span> <span class="n">in_channels</span><span class="p">:</span> <span class="nb">int</span><span class="p">,</span> <span class="n">out_channels</span><span class="p">:</span> <span class="nb">int</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">:</span>
    <span class="s">"""
    Creates a dummy convolutional stack
    """</span>
    <span class="n">lst</span> <span class="o">=</span> <span class="p">[]</span>
    <span class="k">for</span> <span class="n">i</span><span class="p">,</span> <span class="n">k</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">kernels</span><span class="p">):</span>
        <span class="n">ic</span> <span class="o">=</span> <span class="n">in_channels</span> <span class="k">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span> <span class="k">else</span> <span class="n">out_channels</span>
        <span class="n">lst</span><span class="p">.</span><span class="n">append</span><span class="p">(</span>
            <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Conv1d</span><span class="p">(</span>
                <span class="n">ic</span><span class="p">,</span>  <span class="c1"># input channels
</span>                <span class="n">out_channels</span><span class="p">,</span>  <span class="c1"># output channels
</span>                <span class="n">k</span><span class="p">,</span>
                <span class="n">bias</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
                <span class="n">padding</span><span class="o">=</span><span class="mi">0</span><span class="p">,</span>
            <span class="p">)</span>
        <span class="p">)</span>
    <span class="k">return</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span><span class="o">*</span><span class="n">lst</span><span class="p">)</span>

<span class="k">class</span> <span class="nc">iSTFTWrap</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Module</span><span class="p">):</span>
    <span class="k">def</span> <span class="nf">__init__</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">n_fft</span><span class="p">,</span> <span class="n">hop_length</span><span class="p">):</span>
        <span class="nb">super</span><span class="p">().</span><span class="n">__init__</span><span class="p">()</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_istft</span> <span class="o">=</span> <span class="n">iSTFT</span><span class="p">(</span>
            <span class="n">n_fft</span><span class="o">=</span><span class="n">n_fft</span><span class="p">,</span>
            <span class="n">win_length</span><span class="o">=</span><span class="n">n_fft</span><span class="p">,</span>
            <span class="n">hop_length</span><span class="o">=</span><span class="n">hop_length</span><span class="p">,</span>
            <span class="n">center</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>
        <span class="p">)</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_left</span> <span class="o">=</span> <span class="n">n_fft</span> <span class="o">-</span> <span class="n">n_fft</span> <span class="o">%</span> <span class="n">hop_length</span>
        <span class="bp">self</span><span class="p">.</span><span class="n">_right</span> <span class="o">=</span> <span class="n">n_fft</span> <span class="o">-</span> <span class="n">hop_length</span>
    
    <span class="k">def</span> <span class="nf">forward</span><span class="p">(</span><span class="bp">self</span><span class="p">,</span> <span class="n">x</span><span class="p">:</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="n">torch</span><span class="p">.</span><span class="n">Tensor</span><span class="p">:</span>
        <span class="c1"># the input x is (batch x 1026 x frames)
</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">view</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">0</span><span class="p">),</span> <span class="mi">2</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="n">x</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">2</span><span class="p">))</span>  <span class="c1"># (batch x 2 x 513 x frames)
</span>        <span class="n">x</span> <span class="o">=</span> <span class="n">x</span><span class="p">.</span><span class="n">permute</span><span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>  <span class="c1"># (batch x 513 x frames x 2)
</span>        <span class="n">y</span> <span class="o">=</span> <span class="bp">self</span><span class="p">.</span><span class="n">_istft</span><span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">onesided</span><span class="o">=</span><span class="bp">True</span><span class="p">)</span>
        <span class="k">return</span> <span class="n">y</span><span class="p">[:,</span> <span class="bp">self</span><span class="p">.</span><span class="n">_left</span><span class="p">:</span><span class="o">-</span><span class="bp">self</span><span class="p">.</span><span class="n">_right</span><span class="p">]</span>

<span class="s">"""
Finally the model, which is a stack of
STFT -&gt; conv_stack -&gt; out_conv -&gt; iSTFT
"""</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">nn</span><span class="p">.</span><span class="n">Sequential</span><span class="p">(</span>
    <span class="n">STFT</span><span class="p">(</span>
        <span class="n">n_fft</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">win_length</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">hop_length</span><span class="o">=</span><span class="mi">320</span><span class="p">,</span>
        <span class="n">center</span><span class="o">=</span><span class="bp">False</span><span class="p">,</span>  <span class="c1"># disables the padding
</span>        <span class="n">output_format</span><span class="o">=</span><span class="s">"Magnitude"</span><span class="p">,</span>
        <span class="n">pad_mode</span><span class="o">=</span><span class="s">"constant"</span>
    <span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">,</span> <span class="mi">5</span><span class="p">],</span> <span class="mi">513</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>
    <span class="n">create_conv_stack</span><span class="p">([</span><span class="mi">7</span><span class="p">],</span> <span class="mi">1</span><span class="p">,</span> <span class="mi">513</span> <span class="o">*</span> <span class="mi">2</span><span class="p">),</span>
    <span class="n">iSTFTWrap</span><span class="p">(</span>
        <span class="n">n_fft</span><span class="o">=</span><span class="mi">1024</span><span class="p">,</span>
        <span class="n">hop_length</span><span class="o">=</span><span class="mi">320</span><span class="p">,</span>
    <span class="p">)</span>
<span class="p">)</span>

<span class="c1"># define receptive fields of each layer in the stack
# for each layer, specify (left_receptive, right_receptive, resolution)
</span><span class="n">receptives_with_resolutions</span> <span class="o">=</span> <span class="p">[</span>
    <span class="p">(</span><span class="mi">1024</span> <span class="o">-</span> <span class="mi">320</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span>  <span class="c1"># STFT: requires (win_len - hop_size) on the left 
</span>    <span class="p">((</span><span class="mi">5</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="mi">3</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span>  <span class="c1"># conv_stack: 3 causal layers with receptive of (kernel_size - 1) 
</span>    <span class="p">(</span><span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">7</span> <span class="o">//</span> <span class="mi">2</span><span class="p">,</span> <span class="mi">320</span><span class="p">),</span> <span class="c1"># projection conv: non-causal conv layer with symmetric repective of kernel_size // 2
</span>    <span class="p">(</span><span class="mi">960</span><span class="p">,</span> <span class="mi">0</span><span class="p">,</span> <span class="mi">1</span><span class="p">),</span> <span class="c1"># iSTFT: requires hop_length * overlap * 2 on the left and receptive on the right
</span><span class="p">]</span>
<span class="c1"># bring all to the same resolution (samples)
</span><span class="n">receptives</span> <span class="o">=</span> <span class="p">[(</span><span class="n">left</span> <span class="o">*</span> <span class="n">res</span><span class="p">,</span> <span class="n">right</span> <span class="o">*</span> <span class="n">res</span><span class="p">)</span> <span class="k">for</span> <span class="n">left</span><span class="p">,</span> <span class="n">right</span><span class="p">,</span> <span class="n">res</span> <span class="ow">in</span> <span class="n">receptives_with_resolutions</span><span class="p">]</span>
<span class="c1"># this is our overlap in case of chunked synthesis
</span><span class="n">left</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">left</span> <span class="k">for</span> <span class="n">left</span><span class="p">,</span> <span class="n">_</span> <span class="ow">in</span> <span class="n">receptives</span><span class="p">])</span>  <span class="c1"># 6464
# and this is our architectural latency, in case of online processing
</span><span class="n">right</span> <span class="o">=</span> <span class="nb">sum</span><span class="p">([</span><span class="n">right</span> <span class="k">for</span> <span class="n">_</span><span class="p">,</span> <span class="n">right</span> <span class="ow">in</span> <span class="n">receptives</span><span class="p">])</span>  <span class="c1"># 960
</span></code></pre></div></div>

<p>Now, let’s compute the expected input/output lengths. Notice that there is no <code class="language-plaintext highlighter-rouge">extra_to_drop</code> because the chunk size is divisible by the <code class="language-plaintext highlighter-rouge">hop_size</code>.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">input_length</span> <span class="o">=</span> <span class="mi">320</span> <span class="o">*</span> <span class="mi">50</span> <span class="o">+</span> <span class="mi">1024</span>
<span class="n">chunk_size</span> <span class="o">=</span> <span class="n">input_length</span> <span class="o">-</span> <span class="n">left</span> <span class="o">-</span> <span class="n">right</span>  <span class="c1"># 9600
</span></code></pre></div></div>

<p>Finally, let’s confirm that running inference on chunks produces the same result as when processing the entire input.</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># some random input
</span><span class="n">x</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">randn</span><span class="p">((</span><span class="mi">1</span><span class="p">,</span> <span class="mi">100000</span><span class="p">))</span>
<span class="c1"># inference on the whole input
</span><span class="n">y</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">)</span>

<span class="c1"># running inference on the first chunk
</span><span class="n">y_chunk_1</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="p">:</span><span class="n">input_length</span><span class="p">])</span>
<span class="c1"># running inference on the second chunk, carefully shifting input
</span><span class="n">start</span> <span class="o">=</span> <span class="n">chunk_size</span> <span class="o">-</span> <span class="n">extra_to_drop</span>
<span class="n">y_chunk_2</span> <span class="o">=</span> <span class="n">model</span><span class="p">(</span><span class="n">x</span><span class="p">[:,</span> <span class="n">start</span><span class="p">:(</span><span class="n">start</span> <span class="o">+</span> <span class="n">input_length</span><span class="p">)])</span>
<span class="c1"># now we need to slice off the `extra_to_drop` from both outputs
</span><span class="n">y_chunk_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">y_chunk_1</span><span class="p">,</span> <span class="n">y_chunk_2</span><span class="p">]</span>
<span class="k">if</span> <span class="n">extra_to_drop</span> <span class="o">&gt;</span> <span class="mi">0</span><span class="p">:</span>
    <span class="n">y_chunk_list</span> <span class="o">=</span> <span class="p">[</span><span class="n">chunk</span><span class="p">[:,</span> <span class="p">:</span><span class="o">-</span><span class="n">extra_to_drop</span><span class="p">]</span> <span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">y_chunk_list</span><span class="p">]</span>
<span class="n">y_chunk</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">cat</span><span class="p">(</span><span class="n">y_chunk_list</span><span class="p">,</span> <span class="n">dim</span><span class="o">=</span><span class="mi">1</span><span class="p">)</span>

<span class="n">y_trimmed</span> <span class="o">=</span> <span class="n">y</span><span class="p">[:,</span> <span class="p">:</span><span class="n">y_chunk</span><span class="p">.</span><span class="n">size</span><span class="p">(</span><span class="mi">1</span><span class="p">)]</span>
<span class="c1"># finally compare the chunked inference to the original one.
# we run inference only on two chunks, omitting handling the
# padding on the right for simplicity. So the comparison
# is done only for beginning of the output.
</span><span class="n">diff</span> <span class="o">=</span> <span class="n">y_trimmed</span> <span class="o">-</span> <span class="n">y_chunk</span>
<span class="c1"># should be the same
</span><span class="k">assert</span> <span class="n">torch</span><span class="p">.</span><span class="n">mean</span><span class="p">(</span><span class="n">torch</span><span class="p">.</span><span class="nb">abs</span><span class="p">(</span><span class="n">diff</span><span class="p">))</span> <span class="o">&lt;</span> <span class="mf">1e-5</span>
</code></pre></div></div>

<h1 id="takeaways">Takeaways</h1>

<p>In this post, we delved into how to perform streaming inference (a.k.a., inference in chunks) for models consisting of various convolutional layers. This insight is crucial for building online audio or image processing applications. It boils down to carefully computing the receptive field of the resulting model and managing overlap between input chunks. When multiple layers are combined, however, additional attention should be paid to feeding your model with input of proper length, which in turn requires more sophisticated input/output handling. Hopefully, the explanations and code snippets provided will help you navigate these challenges in your own architectural designs.</p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="streaming" /><category term="convolution" /><summary type="html"><![CDATA[In this post, we explore how to apply convolutional layers to infinitely long inputs, specifically focusing on how to process inputs in chunks to minimize latency. For instance, in text-to-speech applications, instead of synthesizing an entire sentence at once, we prefer to generate and play back audio in segments. While recurrent or autoregressive networks are inherently causal and thus well-suited for streaming processing, convolutional layers present more challenges and require careful handling.]]></summary></entry><entry><title type="html">Dissecting BARK</title><link href="https://balacoon.com/blog/dissecting_bark/" rel="alternate" type="text/html" title="Dissecting BARK" /><published>2023-08-15T00:00:00+00:00</published><updated>2023-08-15T05:20:02+00:00</updated><id>https://balacoon.com/blog/dissecting_bark</id><content type="html" xml:base="https://balacoon.com/blog/dissecting_bark/"><![CDATA[<p>Things started to get stale after the ubiquitous switch to Neural Text-to-Speech.
A long-awaited leap forward was introduced thanks to ideas from the blossoming image generation field.
<a href="https://github.com/neonbjb/tortoise-tts">TorToiSe</a> adopted techniques
introduced in DALL-E<a href="#1">[1]</a> and simultaneously pushed frontiers of:</p>

<ul>
  <li>expressive speech synthesis</li>
  <li>paralinguistic generation</li>
  <li>voice cloning</li>
</ul>

<p>These improvements became possible due to more powerful generative models and an unprecedented training scale.
Instead of traditional 20 hours of speech and 10-50M parameters models,
TorToiSe used thousands of hours and hundreds of millions of trainable parameters.
It sparked a whole series of papers that developed the approach further,
with AudioLM<a href="#2">[2]</a>, VALL-E<a href="#3">[3]</a>, and SPEAR-TTS<a href="#4">[4]</a> being the most prominent among others.
In this post, we will explore the internals of the
<a href="https://github.com/suno-ai/bark">BARK</a> - an open-source implementation of this new speech synthesis paradigm.</p>

<h2 id="architecture">Architecture</h2>

<p>BARK quite closely follows VALL-E architecture, utilizing a large autoregressive transformer
decoder to operate on discrete speech representations.</p>

<h3 id="tokenization">Tokenization</h3>

<p>Discrete speech representations or tokens are obtained from
two pre-trained models - HuBERT<a href="#5">[5]</a> and Encodec<a href="#6">[6]</a>:</p>
<figure style="width: 600px" class="align-center">
  <img src="https://balacoon.com/assets/images/bark_tokenization.png" alt="" />
  <figcaption class="figure-caption text-center">Tokenization in BARK</figcaption>
</figure>
<p>HuBERT is a semi-supervised model that converts audio to discrete “semantic tokens.”
The objective of the HuBERT training makes extracted representations speaker- and prosody-(quasi)independent.
Think of pseudo-phonemes annotation on a frame level.</p>

<p>Encodec is a neural vocoder that works on multi-level discrete representations extracted
from audio in an auto-encoding manner using vector quantization.
Think of mel-spectrogram frames but discrete.
Low-order representations are called “coarse tokens,” and higher-order ones are “fine tokens.”</p>

<h3 id="acoustic-modeling">Acoustic Modeling</h3>

<p>By discretizing the speech into tokens, we reformulate the speech production task
into predicting coarse and fine tokens from semantic tokens.
Such formulation allows us to use a state-of-art generative model - an autoregressive transformer decoder.
The very same model is used in ChatGPT.
A lot of data is needed to train this beast: billions of tokens or thousands of hours of speech.
This magnitude is only possible in a multi-speaker scenario,
which requires to condition a generative model on speaker identity.
It is done via “prompting” (natural language processing parallels intensify),
where prompt - is an utterance of a speaker predecessing the current one.
The prompt carries information about speaker identity, recording conditions,
and even some high-level prosody aspects, but not the actual content.
Fine tokens just refine the acoustic information from coarse tokens and don’t need as powerful modeling.
The transformer encoder (i.e. parallel architecture) is used for fine tokens prediction to speed things up.</p>
<figure style="width: 600px" class="align-center">
  <img src="https://balacoon.com/assets/images/bark_acoustic.png" alt="" />
  <figcaption class="figure-caption text-center">Acoustic modeling in BARK</figcaption>
</figure>
<p>Coarse and fine tokens have multiple levels. To plug them into the models, token sequences are simply flattened.</p>

<h3 id="neural-frontend">Neural Frontend</h3>

<p>Semantic tokens are extracted from the audio.
For a text-to-speech task, they need to be predicted from the input text.
It is done with yet another autoregressive transformer decoder.
This task also requires a powerful generative model since it contributes to the resulting intonation.
Converting text to semantic tokens is a sequence-to-sequence task,
where correspondence is defined by durations of sounds and overall speech pace.</p>
<figure style="width: 300px" class="align-center">
  <img src="https://balacoon.com/assets/images/bark_text.png" alt="" />
  <figcaption class="figure-caption text-center">Neural Frontend in BARK</figcaption>
</figure>
<p>A sufficiently sizeable neural frontend has no problems generating semantic tokens
for inputs in multiple languages or consistently annotated paralinguistics (laughs, breaths, gasps, etc.).</p>

<h2 id="performance">Performance</h2>

<p>Generating speech with BARK is not fast.
Let’s have a look into which components are the most demanding.
Measurements are done on GPU, averaging inference time for multiple utterances of roughly 10 seconds each.</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th style="text-align: center">Function</th>
      <th style="text-align: center">Parameters</th>
      <th style="text-align: center">Average Inference time, s</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>HuBERT</td>
      <td style="text-align: center">audio → semantic tokens</td>
      <td style="text-align: center">95M</td>
      <td style="text-align: center">0.035</td>
    </tr>
    <tr>
      <td>Encodec Decoder</td>
      <td style="text-align: center">coarse/fine tokens → audio</td>
      <td style="text-align: center">15M</td>
      <td style="text-align: center">0.025</td>
    </tr>
    <tr>
      <td>Text AR Transformer Decoder</td>
      <td style="text-align: center">text → semantic tokens</td>
      <td style="text-align: center">446M</td>
      <td style="text-align: center">11</td>
    </tr>
    <tr>
      <td>AR Transformer Decoder</td>
      <td style="text-align: center">semantic tokens → coarse tokens</td>
      <td style="text-align: center">328M</td>
      <td style="text-align: center"><strong>45</strong></td>
    </tr>
    <tr>
      <td>Transformer Encoder</td>
      <td style="text-align: center">coarse tokens → fine tokens</td>
      <td style="text-align: center">319M</td>
      <td style="text-align: center">0.37</td>
    </tr>
  </tbody>
</table>

<p>Flattening coarse/fine tokens requires AR Transformer Decoder to work on a very long context.
This makes it the slowest component of the whole pipeline by far.</p>

<h2 id="speeding-things-up">Speeding things up</h2>

<p>It’s not the first time slow autoregressive modeling has crossed out real-time speech generation.
An autoregressive version of WaveNet<a href="#7">[7]</a> also puzzled the community back in the day with outstanding
quality at the cost of extremely slow inference.
But things got faster both with inference optimizations and modeling advances.
The same applies in this case.
For example, NaturalSpeech 2<a href="#8">[8]</a> proposes employing a parallel diffusion model instead of
an autoregressive transformer decoder as a possible mitigation.
We will have a brief look into possible inference optimizations.</p>

<h3 id="rwkv"><a href="https://github.com/BlinkDL/RWKV-LM">RWKV</a></h3>

<p>The quadratic complexity of attention fuels interest in so-called attention-free architectures.
RWKV - is a parallelizable RNN with a performance of
a classical transformer and linear complexity with respect to context length.
Here is some code samples allowing to drag-race a dummy RWKV model and estimate expected gains.
Creating a dummy model of 350M parameters (from within <a href="https://github.com/BlinkDL/RWKV-LM/tree/main/RWKV-v4">RWKV-v4</a>):</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="nn">src.model</span> <span class="kn">import</span> <span class="n">GPT</span><span class="p">,</span> <span class="n">GPTConfig</span>
<span class="kn">import</span> <span class="nn">torch</span>
<span class="n">model</span> <span class="o">=</span> <span class="n">GPT</span><span class="p">(</span><span class="n">GPTConfig</span><span class="p">(</span><span class="mi">12096</span><span class="p">,</span> <span class="mi">1024</span><span class="p">,</span> <span class="n">model_type</span><span class="o">=</span><span class="s">"RVKW"</span><span class="p">,</span> <span class="n">n_layer</span><span class="o">=</span><span class="mi">24</span><span class="p">,</span> <span class="n">n_embd</span><span class="o">=</span><span class="mi">1024</span><span class="p">))</span>
<span class="n">device</span> <span class="o">=</span> <span class="n">torch</span><span class="p">.</span><span class="n">device</span><span class="p">(</span><span class="s">"cuda"</span><span class="p">)</span>
<span class="n">model</span><span class="p">.</span><span class="n">to</span><span class="p">(</span><span class="n">device</span><span class="p">)</span>
<span class="n">torch</span><span class="p">.</span><span class="n">save</span><span class="p">(</span><span class="n">model</span><span class="p">.</span><span class="n">state_dict</span><span class="p">(),</span> <span class="s">"rwkv_gpt2_medium.pth"</span><span class="p">)</span>
</code></pre></div></div>

<p>Run generation with a dummy model using <a href="https://pypi.org/project/rwkv/">rwkv from pip</a>:</p>

<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">time</span>
<span class="kn">import</span> <span class="nn">os</span>
<span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">'RWKV_JIT_ON'</span><span class="p">]</span> <span class="o">=</span> <span class="s">'1'</span>
<span class="n">os</span><span class="p">.</span><span class="n">environ</span><span class="p">[</span><span class="s">"RWKV_CUDA_ON"</span><span class="p">]</span> <span class="o">=</span> <span class="s">'1'</span>

<span class="kn">from</span> <span class="nn">rwkv.model</span> <span class="kn">import</span> <span class="n">RWKV</span>

<span class="n">model</span> <span class="o">=</span> <span class="n">RWKV</span><span class="p">(</span><span class="n">model</span><span class="o">=</span><span class="s">'rwkv_gpt2_medium.pth'</span><span class="p">,</span> <span class="n">strategy</span><span class="o">=</span><span class="s">'cuda fp16'</span><span class="p">)</span>
<span class="n">state</span> <span class="o">=</span> <span class="bp">None</span>
<span class="c1"># warm up
</span><span class="k">for</span> <span class="n">i</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">100</span><span class="p">):</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">forward</span><span class="p">([</span><span class="mi">100</span><span class="p">],</span> <span class="n">state</span><span class="p">)</span>
<span class="c1"># measure generation of 1k tokens
</span><span class="n">state</span> <span class="o">=</span> <span class="bp">None</span>
<span class="n">start</span> <span class="o">=</span> <span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span>
<span class="k">for</span> <span class="n">_</span> <span class="ow">in</span> <span class="nb">range</span><span class="p">(</span><span class="mi">1000</span><span class="p">):</span>
    <span class="n">_</span><span class="p">,</span> <span class="n">state</span> <span class="o">=</span> <span class="n">model</span><span class="p">.</span><span class="n">forward</span><span class="p">([</span><span class="mi">100</span><span class="p">],</span> <span class="n">state</span><span class="p">)</span>
<span class="k">print</span><span class="p">(</span><span class="n">time</span><span class="p">.</span><span class="n">time</span><span class="p">()</span> <span class="o">-</span> <span class="n">start</span><span class="p">)</span>
</code></pre></div></div>

<p>It takes ~12 seconds which is already a valuable improvement.
It will become even more prominent with longer contexts and bigger models.</p>

<h3 id="faster-transformer"><a href="https://github.com/NVIDIA/FasterTransformer">Faster Transformer</a></h3>

<p>Faster Transformer is a library by NVIDIA which implements heavily optimized inference of Large Language Models.
It runs in a dedicated docker container with custom CUDA kernels for particular models.
Here is a small snippet of code to check out the performance on 350M GPT model:</p>

<div class="language-bash highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c"># pulling container and building FT for you GPU</span>
docker pull nvcr.io/nvidia/pytorch:22.09-py3
nvidia-docker run <span class="nt">-ti</span> <span class="nt">--shm-size</span> 5g <span class="nt">--rm</span> nvcr.io/nvidia/pytorch:22.09-py3 bash
git clone https://github.com/NVIDIA/FasterTransformer.git
<span class="nb">mkdir</span> <span class="nt">-p</span> FasterTransformer/build
<span class="nb">cd </span>FasterTransformer/build
git submodule init <span class="o">&amp;&amp;</span> git submodule update
<span class="c"># from https://github.com/NVIDIA/FasterTransformer/issues/90 for RTX3090</span>
cmake <span class="nt">-DSM</span><span class="o">=</span>86 <span class="nt">-DCMAKE_BUILD_TYPE</span><span class="o">=</span>Release <span class="nt">-DBUILD_PYT</span><span class="o">=</span>ON ..
make <span class="nt">-j12</span>

<span class="c"># pulling 350M GPT model</span>
pip <span class="nb">install</span> <span class="nt">-r</span> ../examples/pytorch/gpt/requirement.txt
git clone https://huggingface.co/gpt2-medium
curl <span class="nt">-s</span> https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash
apt-get <span class="nb">install </span>git-lfs
<span class="nb">cd </span>gpt2-medium <span class="o">&amp;&amp;</span> git lfs pull <span class="o">&amp;&amp;</span> <span class="nb">cd</span> ..
python ../examples/pytorch/gpt/utils/huggingface_gpt_convert.py <span class="nt">-i</span> gpt2-medium/ <span class="se">\</span>
    <span class="nt">-o</span> ../models/huggingface-models/c-model/gpt2-medium <span class="nt">-i_g</span> 1
<span class="nb">echo</span> <span class="s2">"hello world"</span> <span class="o">&gt;</span> context.txt

<span class="c"># run generation on a GPU for 1k tokens</span>
<span class="nb">time </span><span class="nv">CUDA_VISIBLE_DEVICES</span><span class="o">=</span>1 python ../examples/pytorch/gpt/multi_gpu_gpt_example.py <span class="se">\</span>
    <span class="nt">--ckpt_path</span> ../models/huggingface-models/c-model/gpt2-medium/1-gpu/ <span class="se">\</span>
    <span class="nt">--time</span> <span class="nt">--inference_data_type</span> fp16 <span class="nt">--tensor_para_size</span> 1 <span class="nt">--pipeline_para_size</span> 1 <span class="se">\</span>
    <span class="nt">--beam_width</span> 1 <span class="nt">--top_k</span> 1 <span class="nt">--top_p</span> 0 <span class="nt">--temperature</span> 1.0 <span class="nt">--return_cum_log_probs</span> 0 <span class="se">\</span>
    <span class="nt">--output_len</span> 1000  <span class="nt">--vocab_file</span> gpt2-medium/vocab.json <span class="nt">--merges_file</span> gpt2-medium/merges.txt  <span class="se">\</span>
    <span class="nt">--max_batch_size</span> 1 <span class="nt">--min_length</span> 1000 <span class="nt">--lib_path</span> lib/libth_transformer.so <span class="se">\</span>
    <span class="nt">--sample_input_file</span> context.txt
</code></pre></div></div>

<p>It takes only 1.5 seconds, a mind-blowing speed up compared to the original performance.</p>

<h2 id="references">References</h2>

<p><a id="1">[1]</a>
<a href="https://cdn.openai.com/papers/dall-e-2.pdf">Hierarchical Text-Conditional Image Generation with CLIP Latents</a></p>

<p><a id="2">[2]</a>
<a href="https://arxiv.org/pdf/2209.03143.pdf">AudioLM: a Language Modeling Approach to Audio Generation</a></p>

<p><a id="3">[3]</a>
<a href="https://arxiv.org/pdf/2301.02111.pdf">Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers</a></p>

<p><a id="4">[4]</a>
<a href="https://arxiv.org/abs/2302.03540">Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision</a></p>

<p><a id="5">[5]</a>
<a href="https://arxiv.org/abs/2106.07447">HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units</a></p>

<p><a id="6">[6]</a>
<a href="https://arxiv.org/abs/2210.13438">High Fidelity Neural Audio Compression</a></p>

<p><a id="7">[7]</a>
<a href="https://arxiv.org/abs/1609.03499">WaveNet: A Generative Model for Raw Audio</a></p>

<p><a id="8">[8]</a>
<a href="https://arxiv.org/abs/2304.09116">NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers</a></p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="BARK" /><category term="zero-shot" /><summary type="html"><![CDATA[Things started to get stale after the ubiquitous switch to Neural Text-to-Speech. A long-awaited leap forward was introduced thanks to ideas from the blossoming image generation field. TorToiSe adopted techniques introduced in DALL-E[1] and simultaneously pushed frontiers of:]]></summary></entry><entry><title type="html">Zero-shot speech generation benchmark</title><link href="https://balacoon.com/blog/zero-shot-benchmark/" rel="alternate" type="text/html" title="Zero-shot speech generation benchmark" /><published>2023-07-31T00:00:00+00:00</published><updated>2023-07-31T21:20:02+00:00</updated><id>https://balacoon.com/blog/zero-shot-benchmark</id><content type="html" xml:base="https://balacoon.com/blog/zero-shot-benchmark/"><![CDATA[<p>Synthesizing speech with a speaker identity not seen during training presents a significant challenge. Traditionally, achieving this required extensive training on many speakers to ensure a continuous speaker space<a href="#1">[1]</a>. The most performant methods, such as <a href="https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI/blob/main/docs/README.en.md">RVC</a>, still need minimal fine-tuning with ~10 minutes of target speaker data to achieve reasonable quality. However, the approaches leveraging the power of big models are gaining momentum. For instance, Microsoft’s VALL-E<a href="#2">[2]</a> boldly claims to clone a speaker’s voice with just 3 seconds of speech as a reference. In this blog post, we aim to present a benchmark of voice conversion technologies, comparing <a href="https://play.google.com/store/apps/details?id=com.app.vc&amp;hl=en_US">Revoice</a> to the widely spread zero-shot VC baselines.</p>

<h2 id="testsets">Testsets</h2>

<p>Typical evaluations of Voice Conversion systems rely on objective metrics collected from running conversion on unseen multi-speaker corpora.
We design the evaluation to be insightful for the <code class="language-plaintext highlighter-rouge">Revoice</code> use-case. We use multi-speaker corpora as a source or input audio and
a library of speakers from Revoice app as a target or reference audio. Input audio is derived from:</p>

<ul>
  <li><a href="https://datashare.ed.ac.uk/handle/10283/3443">VCTK</a> - classical voice conversion benchmark. Clean recordings, multiple accents.</li>
  <li>DAPS corpus<a href="#3">[3]</a> - emulated mobile device recordings in various conditions. This dataset resembles the audio quality
we obtain as a Voice Conversion service more closely.</li>
</ul>

<h2 id="metrics">Metrics</h2>

<p>We measure three model-based objective metrics for the converted speech:</p>

<ul>
  <li>Speaker similarity: we measure a cosine distance between a latent speaker representation from converted speech and reference audio.
We use ECAPA<a href="#4">[4]</a> speaker encoder by <a href="https://huggingface.co/speechbrain/spkrec-ecapa-voxceleb">Speechbrain</a> to extract speaker representation.</li>
  <li>Speech intelligibility: we run speech recognition with <a href="https://huggingface.co/nvidia/stt_en_conformer_transducer_xlarge#model-architecture">Conformer-Transducer ASR model by NVIDIA</a> on the converted speech and measure the Character Error Rate with respect to the transcription.</li>
  <li>Naturalness: we use a pre-trained Mean Opinion Score estimator UTMOS<a href="#5">[5]</a> <a href="https://huggingface.co/spaces/sarulab-speech/UTMOS-demo">released</a> by the authors.</li>
</ul>

<h2 id="baselines">Baselines</h2>

<p>We select two widely-spread systems as the baselines. Both are trained on a large number of speakers and are capable of zero-shot speech generation.</p>

<ul>
  <li>YourTTS<a href="#6">[6]</a> (from 2021) is a VITS architecture model with adjustments trained on VCTK + LibriTTS datasets.
It uses an invertible normalizing flow to disentangle speaker identity from the spectrogram representation.
Handy tutorial on how to run it can be found <a href="https://colab.research.google.com/drive/1gjdwOKCZuavPn_5oy8QA01sKmXpEq5AZ?usp=sharing#scrollTo=jeQ9O6llm8D5">here</a>.</li>
  <li><a href="https://github.com/suno-ai/bark">BARK</a> (from 2022) is a large (350M parameters) decoder-only transformer that generates speech from “semantic tokens.” Those are self-supervised representations extracted with HuBERT<a href="#7">[7]</a> that effectively disentangle content (semantics) and speaker characteristics. Running Voice Conversion with BARK is not straightforward, because extraction of semantic tokens is not released.
Suno.ai only provides prediction of semantic tokens from text. Fortunately, there is a <a href="https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer">community contributed semantic tokens extractors</a> that are compatible with BARK. This addition allows to <a href="https://github.com/gitmylo/bark-voice-cloning-HuBERT-quantizer/blob/master/notebook.ipynb">create own voice profiles</a> and perform voice conversion, adjusting semantic tokens and voice profiles in <a href="https://github.com/serp-ai/bark-with-voice-clone/blob/main/generate.ipynb">this notebook</a>.</li>
</ul>

<p>The autoregressive transformer decoder in BARK is significantly slower than parallel conversion in YourTTS, but it has greater potential due to the model’s scalability.</p>

<h2 id="results">Results</h2>

<p>We present results of the evaluations in the tables below.
Here is performance of the systems on <code class="language-plaintext highlighter-rouge">VCTK</code>:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th style="text-align: center">Naturalness(MOS↑)</th>
      <th style="text-align: center">Intelligibility(CER, %↓)</th>
      <th style="text-align: center">Similarity(inverted cosine distance↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>no model</td>
      <td style="text-align: center">4.06</td>
      <td style="text-align: center">0.17</td>
      <td style="text-align: center">-</td>
    </tr>
    <tr>
      <td>YourTTS<a href="#0">*</a></td>
      <td style="text-align: center">3.21</td>
      <td style="text-align: center"><strong>1.08</strong></td>
      <td style="text-align: center">0.613</td>
    </tr>
    <tr>
      <td>BARK</td>
      <td style="text-align: center"><strong>3.49</strong></td>
      <td style="text-align: center">2.58</td>
      <td style="text-align: center">0.692</td>
    </tr>
    <tr>
      <td>Revoice</td>
      <td style="text-align: center">3.45</td>
      <td style="text-align: center">1.36</td>
      <td style="text-align: center">0.614</td>
    </tr>
  </tbody>
</table>

<p>And performance on <code class="language-plaintext highlighter-rouge">DAPS</code>:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th style="text-align: center">Naturalness(MOS↑)</th>
      <th style="text-align: center">Intelligibility(CER, %↓)</th>
      <th style="text-align: center">Similarity(inverted cosine distance↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>no model</td>
      <td style="text-align: center">2.39</td>
      <td style="text-align: center">2.755</td>
      <td style="text-align: center">-</td>
    </tr>
    <tr>
      <td>YourTTS</td>
      <td style="text-align: center">2.08</td>
      <td style="text-align: center">26.7</td>
      <td style="text-align: center">0.655</td>
    </tr>
    <tr>
      <td>BARK</td>
      <td style="text-align: center"><strong>2.85</strong></td>
      <td style="text-align: center"><strong>14.77</strong></td>
      <td style="text-align: center">0.738</td>
    </tr>
    <tr>
      <td>Revoice</td>
      <td style="text-align: center">2.81</td>
      <td style="text-align: center">16.56</td>
      <td style="text-align: center"><strong>0.564</strong></td>
    </tr>
  </tbody>
</table>

<p>Small example of how systems actually sound. For the these inputs:</p>

<figure>
<figcaption>
  Source audio
</figcaption>
<audio controls="">
  <source src="/assets/posts/vc_benchmark/source.mp3" type="audio/mpeg" />
</audio>
</figure>
<figure>
<figcaption>
  Reference of target voice
</figcaption>
<audio controls="">
  <source src="/assets/demo_audio/vc/kratos_short.mp3" type="audio/mpeg" />
</audio>
</figure>

<p>The systems produce following outputs:</p>

<figure>
<figcaption>
  YourTTS
</figcaption>
<audio controls="">
  <source src="/assets/posts/vc_benchmark/yourtts.mp3" type="audio/mpeg" />
</audio>
</figure>
<figure>
<figcaption>
  BARK
</figcaption>
<audio controls="">
  <source src="/assets/posts/vc_benchmark/bark.mp3" type="audio/mpeg" />
</audio>
</figure>
<figure>
<figcaption>
  Revoice
</figcaption>
<audio controls="">
  <source src="/assets/posts/vc_benchmark/revoice.mp3" type="audio/mpeg" />
</audio>
</figure>

<p>YourTTS shows excellent performance on <code class="language-plaintext highlighter-rouge">VCTK</code> but degrades significantly on more noisy inputs.
BARK consistently delivers clean and intelligible audio, but the speaker similarity lags.
Revoice competes with BARK in terms of naturalness and intelligibility while making a leap
forward in terms of speaker similarity.</p>

<h2 id="references">References</h2>
<p><a id="1">[1]</a>
<a href="https://arxiv.org/abs/1806.04558">Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis</a></p>

<p><a id="2">[2]</a>
<a href="https://arxiv.org/abs/2301.02111">Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers</a></p>

<p><a id="3">[3]</a>
<a href="https://ccrma.stanford.edu/~gautham/Site/daps_files/mysore-spl2015.pdf">Can we Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech? — A Dataset, Insights, and Challenges</a></p>

<p><a id="4">[4]</a>
<a href="https://arxiv.org/abs/2104.01466">ECAPA-TDNN Embeddings for Speaker Diarization</a></p>

<p><a id="5">[5]</a>
<a href="https://arxiv.org/abs/2204.02152">UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022</a></p>

<p><a id="6">[6]</a>
<a href="https://arxiv.org/abs/2112.02418">YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone</a></p>

<p><a id="7">[7]</a>
<a href="https://arxiv.org/pdf/2106.07447.pdf">HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units</a></p>

<hr />

<p><a id="0">*</a>
YourTTS uses <code class="language-plaintext highlighter-rouge">VCTK</code> in training, which might give slightly overly optimistic results.</p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="voice conversion" /><category term="zero-shot" /><summary type="html"><![CDATA[Synthesizing speech with a speaker identity not seen during training presents a significant challenge. Traditionally, achieving this required extensive training on many speakers to ensure a continuous speaker space[1]. The most performant methods, such as RVC, still need minimal fine-tuning with ~10 minutes of target speaker data to achieve reasonable quality. However, the approaches leveraging the power of big models are gaining momentum. For instance, Microsoft’s VALL-E[2] boldly claims to clone a speaker’s voice with just 3 seconds of speech as a reference. In this blog post, we aim to present a benchmark of voice conversion technologies, comparing Revoice to the widely spread zero-shot VC baselines.]]></summary></entry><entry><title type="html">Українська мова в Balacoon</title><link href="https://balacoon.com/blog/uk_release/" rel="alternate" type="text/html" title="Українська мова в Balacoon" /><published>2023-07-08T00:00:00+00:00</published><updated>2023-07-08T21:20:02+00:00</updated><id>https://balacoon.com/blog/uk_release</id><content type="html" xml:base="https://balacoon.com/blog/uk_release/"><![CDATA[<p>Швидкий, зручний та якісний нейромережевий синтез українського мовлення тепер в Balacoon.
<a href="https://balacoon.com/use/">Інтеграція бібліотеки синтезу</a> ще ніколи не була такою простою:
<a href="https://balacoon.com/use/tts/package">Python пакети без залежностей</a> для real-time генерації на CPU,
<a href="https://balacoon.com/use/tts/service">Docker контейнер</a> здатний обробляти десятки паралельних запитів на GPU,
<a href="https://balacoon.com/blog/on-device/">найшвидший on-device синтезатор</a>, який дозволяє real-time синтез навіть на RaspberryPi.
І це все тепер безкоштовно доступне для української мови під <a href="https://balacoon.com/license.html">MIT ліцензією</a>.</p>

<figure>
<figcaption>
  Приклад:
</figcaption>
<audio controls="">
  <source src="/assets/demo_audio/uk_example.mp3" type="audio/mpeg" />
</audio>
</figure>

<p>Сгенеруйте більше прикладів в нашому <a href="https://huggingface.co/spaces/balacoon/tts">онлайн демо</a>.</p>

<h1 id="реліз">Реліз</h1>

<p>Дякуємо <a href="https://t.me/speech_synthesis_uk">спільноті синтезу українського мовлення</a>
за створення, популяризацію і підтримку <a href="https://github.com/egorsmkv/ukrainian-tts-datasets">відкритих датасетів</a>.
На їх основі, ми побудували <a href="https://huggingface.co/balacoon/tts">2 моделі</a>:</p>

<ul>
  <li><a href="https://arxiv.org/abs/2203.16852">JETS</a> - стандартна мульти-спікер модель з частотою дискретизації 24kHz.
Підтримує усі наявні голоси: Лада, Тетяна і Микита. Росповсюджується
в двох варіантах:
    <ul>
      <li><code class="language-plaintext highlighter-rouge">uk_ltm_jets_cpu.addon</code> - для синтезу на CPU за допомогою Python пакету <code class="language-plaintext highlighter-rouge">balacoon_tts</code>.</li>
      <li><code class="language-plaintext highlighter-rouge">uk_ltm_jets_gpu.addon</code> - для сервісу в Docker контейнері з використанням GPU.</li>
    </ul>
  </li>
  <li><a href="https://balacoon.com/blog/on-device/#introducing-light">Light</a> - полегшена модель з частотою дискретизації 16kHz для надшвидкої генерації.
Підтримує голос Тетяни. Розповсюджується тільки варіант для CPU: <code class="language-plaintext highlighter-rouge">uk_tetiana_light_cpu.addon</code>.</li>
</ul>

<p>Для аналізу тексту, усі моделі використовують <a href="https://github.com/balacoon/espeak-ng">espeak</a>
з додатковим <a href="https://github.com/lang-uk/ukrainian-word-stress">словником наголосів</a>.</p>

<h1 id="чого-бракує">Чого бракує</h1>

<p>Було б добре оновити підхід до аналізу тексту, а саме:</p>

<ul>
  <li>побудувати правила для нормалізації тексту за допомогою Finite-State-Transducers.
Balacoon <a href="https://github.com/balacoon/learn_to_normalize">підтримує цю технологію</a>
і має <a href="https://github.com/balacoon/en_us_normalization">реалізацію для англійської мови</a>.
Такий підхід легше пітримувати і розширювати, додаючи нові правила.</li>
  <li>Визначення наголосів потребує рішення з контекстуалізованою генерацію вимови<a href="#1">[1]</a>,<a href="#2">[2]</a>.
Цей підхід нажаль ще <a href="https://github.com/balacoon/learn_to_pronounce">не підтримується в Balacoon</a> але ми сподіваємося додати
загальне рішення, яке б було корисним для усіх мов з омографами.
Як тимчасове рішення, користувачі можуть вказувати бажані наголоси за допомогою 
<a href="https://uk.wikipedia.org/wiki/%D0%90%D0%BA%D1%83%D1%82">“акутів”</a>.</li>
</ul>

<p>Також планується додати підтримку багатомовного синтезу.
Зараз проблема генерації латиниці вирішується <a href="https://github.com/balacoon/espeak-ng/blob/master/packing_info/uk/phoneme_mapping">простими правилами</a>.
Але сучасним рішенням було б створення системи синтезу з підтримкою багатьох мов.
Balacoon працює з <a href="https://balacoon.com/blog/balacoon_phonemeset/">уніфікованим набором фонем</a>, що має спростити такий перехід.</p>

<h1 id="підтримка-та-відгуки">Підтримка та відгуки</h1>

<p>Долучайтеся до нашого <a href="https://join.slack.com/t/balacoon/shared_invite/zt-1syqpvq75-s7iCBJhZcQrsmrLrAU3fhw">slack каналу</a>.
Обов’язково пишіть як ви використовуєте Balacoon, що працює добре, а що не дуже.</p>

<h2 id="посилання">Посилання</h2>
<p><a id="1">[1]</a>
<a href="https://arxiv.org/pdf/2207.13703.pdf">SoundChoice: Grapheme-to-Phoneme Models with Semantic Disambiguation</a></p>

<p><a id="2">[2]</a>
<a href="https://assets.amazon.science/c3/db/23ca18d7450d8dbb5b80a11fcdd3/homograph-disambiguation-with-contextual-word-embeddings-for-tts-systems.pdf">Homograph disambiguation with contextual word embeddings for TTS systems</a></p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="ukrainian" /><summary type="html"><![CDATA[Швидкий, зручний та якісний нейромережевий синтез українського мовлення тепер в Balacoon. Інтеграція бібліотеки синтезу ще ніколи не була такою простою: Python пакети без залежностей для real-time генерації на CPU, Docker контейнер здатний обробляти десятки паралельних запитів на GPU, найшвидший on-device синтезатор, який дозволяє real-time синтез навіть на RaspberryPi. І це все тепер безкоштовно доступне для української мови під MIT ліцензією.]]></summary></entry><entry><title type="html">Balacoon TTS on-device</title><link href="https://balacoon.com/blog/on-device/" rel="alternate" type="text/html" title="Balacoon TTS on-device" /><published>2023-04-15T00:00:00+00:00</published><updated>2023-04-15T21:20:02+00:00</updated><id>https://balacoon.com/blog/on-device</id><content type="html" xml:base="https://balacoon.com/blog/on-device/"><![CDATA[<p>Neural text-to-speech brought unprecedented improvements in the naturalness of synthetic speech. But it came with a cost. While parametric and concatenative speech synthesis systems produce tens of seconds of audio in just 1 second of wall time (they deliver &gt;10 xRT<a href="#0">*</a>) on a single CPU core, neural TTS requires way more computational power. You often need a GPU to provide compelling latency for responsive applications. Fortunately, when there is a will, there is a way. Let’s dive into on-device Neural TTS and see what Balacoon has to offer.</p>

<h2 id="on-device-neural-tts-recap">On-device Neural TTS recap</h2>
<p>Several milestones of Neural TTS evolution are worth mentioning in this regard. Generating raw waveform is the most computationally expensive part of synthesis. WaveRNN<a href="#1">[1]</a> from Google pioneered real-time synthesis on CPU. The authors used sparsification (dropping most neural network weights) and subscaling (generating multiple samples simultaneously) to achieve remarkable results. Later LPCNet<a href="#2">[2]</a> brought these advances, as well as an idea of mixing signal processing with neural networks, to the public. And finally, in a trend of GAN-based vocoding overtaking the domain, MB-MelGAN<a href="#3">[3]</a> came forward by breaking the curse of auto-regressive waveform generation.</p>

<p>Acoustic features prediction was a less acute problem and down-scaled reasonably well. The most widely spread FastSpeech2<a href="#4">[4]</a> already has only 30M parameters and runs reasonably fast. And with LightSpeech<a href="#5">[5]</a>, Microsoft has shown that it is possible to shrink it down to 2M parameters.</p>

<p>So once VITS<a href="#6">[6]</a> and JETS<a href="#7">[7]</a> paved the way to end-to-end speech synthesis, combining acoustic features prediction and vocoding, it was already clear that low resource end-to-end TTS is just around the corner. Indeed NIX-TTS<a href="#8">[8]</a> came into the game, squishing the whole Neural TTS backend into 5M parameters that run 0.5xRT on a Raspberry PI 3B.</p>

<h2 id="implementations-available">Implementations available</h2>
<p>While <a href="https://github.com/xiph/LPCNet">LPCNet</a> is not so widely used anymore, it is worth mentioning because the implementation contains valuable engineering insights, such as sparsification and vectorization. TensorFlowTTS combines mentioned FastSpeech2 and MB-MelGAN in an <a href="https://github.com/TensorSpeech/TensorFlowTTS/tree/master/examples/android">android example</a> powered by TFLite. Nix-TTS authors release their <a href="https://github.com/rendchevi/nix-tts">code and models</a>. And lastly, there is <a href="https://github.com/rhasspy/piper">Piper</a>, which competes with Nix-TTS in terms of performance (also 5M parameters models), but instead of distillation, it simply downscales VITS architecture.</p>

<h2 id="introducing-light">Introducing Light💡</h2>
<p>We composed our own version of the lightweight TTS model called <strong>Light</strong>. It has fewer parameters compared to default JETS models. Therefore it compromises quality and multi-speaker, multi-lingual capabilities. It also delivers only 16kHz audio instead of 24kHz. On the other hand, <strong>it provides an order of magnitude faster synthesis on the CPU</strong>.</p>

<p>Degradation compared to full-scale model on the held-out test set of “92” Hi-Fi speaker:</p>

<table>
  <thead>
    <tr>
      <th>Model</th>
      <th style="text-align: center"><a href="https://arxiv.org/abs/2204.02152">Naturalness</a> (MOS↑)</th>
      <th style="text-align: center"><a href="https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo/models/stt_en_conformer_transducer_xlarge">Intelligibility</a> (CER, %↓)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>recordings</td>
      <td style="text-align: center">3.92</td>
      <td style="text-align: center">0.32</td>
    </tr>
    <tr>
      <td>en_us_hifi_jets_cpu.addon</td>
      <td style="text-align: center">4.0</td>
      <td style="text-align: center">0.28</td>
    </tr>
    <tr>
      <td>en_us_hifi92_light_cpu.addon</td>
      <td style="text-align: center">3.89</td>
      <td style="text-align: center">0.32</td>
    </tr>
  </tbody>
</table>

<p>Synthesis speed on AMD Ryzen Threadripper 1950X:</p>

<table>
  <thead>
    <tr>
      <th>Model/System</th>
      <th style="text-align: center">faster than real-time (xRT↑)</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td>en_us_hifi_jets_cpu.addon</td>
      <td style="text-align: center">6.02</td>
    </tr>
    <tr>
      <td><a href="https://github.com/rhasspy/piper">Piper (ljspeech)</a></td>
      <td style="text-align: center">29.15</td>
    </tr>
    <tr>
      <td>en_us_hifi92_light_cpu.addon</td>
      <td style="text-align: center"><strong>50.86</strong></td>
    </tr>
  </tbody>
</table>

<p>Synthesis speed on Raspberry PI 3B with Cortex-A53:</p>

<table>
  <tbody>
    <tr>
      <td>Model/System</td>
      <td>faster than real-time (xRT↑)</td>
    </tr>
    <tr>
      <td><a href="https://github.com/rhasspy/piper">Piper (ljspeech)</a></td>
      <td>1.13</td>
    </tr>
    <tr>
      <td>en_us_hifi92_light_cpu.addon</td>
      <td><strong>2.33</strong></td>
    </tr>
  </tbody>
</table>

<p>You can try out <code class="language-plaintext highlighter-rouge">en_us_hifi92_light_cpu.addon</code> in our <a href="https://huggingface.co/spaces/balacoon/tts">huggingface space</a> and use it with <code class="language-plaintext highlighter-rouge">balacoon_tts</code> python package as described in a <a href="https://balacoon.com/use/tts/package">tutorial</a>.</p>

<h2 id="references">References</h2>
<p><a id="1">[1]</a>
<a href="https://arxiv.org/pdf/1802.08435.pdf">Efficient Neural Audio Synthesis</a></p>

<p><a id="2">[2]</a>
<a href="https://jmvalin.ca/papers/lpcnet_icassp2019.pdf">LPCNet: Improving Neural speech synthesis through linear prediction</a></p>

<p><a id="3">[3]</a>
<a href="https://arxiv.org/pdf/2005.05106.pdf">Multi-Band MelGAN: Faster waveform generation for high-quality Text-to-Speech</a></p>

<p><a id="4">[4]</a>
<a href="https://arxiv.org/abs/2006.04558">FastSpeech 2: Fast and High-Quality End-to-End Text to Speech</a></p>

<p><a id="5">[5]</a>
<a href="https://arxiv.org/pdf/2102.04040.pdf">LightSpeech: Lightweight and fast Text-to-Speech with Neural Architecture Search</a></p>

<p><a id="6">[6]</a>
<a href="https://arxiv.org/abs/2106.06103">Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech</a></p>

<p><a id="7">[7]</a>
<a href="https://arxiv.org/abs/2203.16852">JETS: Jointly Training FastSpeech2 and HiFi-GAN for End to End Text to Speech</a></p>

<p><a id="8">[8]</a>
<a href="https://arxiv.org/pdf/2203.15643.pdf">NIX-TTS: Lightweight and end-to-end Text-to-Speech via module-wise distillation</a></p>

<hr />

<p><a id="0">*</a>
There is a certain confusion around “xRT” (times real-time) terminology. Some people mean “how much audio is produced in one second of walltime”; others refer to “how much time it takes to synthesize one second of audio”. While the latter is generally more popular, I stick with the former because numbers like “30xRT” and “50xRT” are easier to comprehend and compare than “0.033xRT” and “0.02xRT”.</p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="on-device" /><category term="RaspberryPi TTS" /><category term="neural TTS" /><summary type="html"><![CDATA[Neural text-to-speech brought unprecedented improvements in the naturalness of synthetic speech. But it came with a cost. While parametric and concatenative speech synthesis systems produce tens of seconds of audio in just 1 second of wall time (they deliver &gt;10 xRT*) on a single CPU core, neural TTS requires way more computational power. You often need a GPU to provide compelling latency for responsive applications. Fortunately, when there is a will, there is a way. Let’s dive into on-device Neural TTS and see what Balacoon has to offer.]]></summary></entry><entry><title type="html">Balacoon TTS as a service</title><link href="https://balacoon.com/blog/tts_endpoint/" rel="alternate" type="text/html" title="Balacoon TTS as a service" /><published>2023-03-20T00:00:00+00:00</published><updated>2023-03-20T21:20:02+00:00</updated><id>https://balacoon.com/blog/tts_endpoint</id><content type="html" xml:base="https://balacoon.com/blog/tts_endpoint/"><![CDATA[<p>In recent years, text-to-speech technology has made tremendous strides,
thanks in large part to advances in machine learning and artificial intelligence.
As a result, synthetic speech is now almost indistinguishable from human speech,
and is being used in a variety of applications, from voice assistants to audiobooks.</p>

<p>However, while there are many cloud-based text-to-speech services available,
(<a href="https://aws.amazon.com/polly/">AWS Polly</a>,
<a href="https://azure.microsoft.com/en-us/products/cognitive-services/text-to-speech">Azure Text-to-speech</a>,
<a href="https://cloud.google.com/text-to-speech">Google cloud Text-to-speech</a> to name a few)
these services can be expensive, and may not always be the best fit for every use case.
That’s why we’re excited to announce the release of our new self-hosted text-to-speech service,
which is available as a Docker image that you can spin up on a GPU instance.</p>

<p>With our self-hosted text-to-speech service, you can get state-of-the-art speech synthesis
within your own infrastructure, without having to rely on cloud service providers.
This can be especially useful for practitioners who need to power their app or service
with synthetic speech in production, and who may have concerns about cost or security.</p>

<p>As the rest of the post delves into the internal workings of the service,
we recommend taking a moment to review the
<a href="https://balacoon.com/use/tts/service">usage documentation</a>,
which demonstrates  how straightforward it is to establish a TTS endpoint.</p>

<h2 id="how-far-1-gpu-can-take-you">How far 1 GPU can take you</h2>

<p>This section aims to set expectations regarding the efficiency of Balacoon TTS,
specifically in terms of how many users can be served using just one GPU to handle requests.
Two primary metrics to consider are:</p>

<ul>
  <li><strong>Latency</strong> - the amount of time a user must wait before obtaining the first chunk of audio.</li>
  <li><strong>Real-time factor (RTF)</strong> - the ratio of the duration of the synthesized audio to the time it took to produce it.</li>
</ul>

<p>Configuring the endpoint involves finding a balance between these two metrics.
Balacoon TTS server uses <a href="https://developer.nvidia.com/nvidia-triton-inference-server">NVIDIA Triton Server</a> internally,
which enables batching of inference requests.
The greater the number of requests that are batched and processed in parallel,
the better the real-time factor will be. However, this comes at the cost of increased
latency since processing more data in parallel requires more time. You have control over the
maximum batch size to process, when you are launching the endpoint.</p>
<figure style="width: 900px" class="align-center">
  <img src="https://balacoon.com/assets/images/tts_server_performance.png" alt="" />
  <figcaption class="figure-caption text-center">Balacoon TTS Service performance</figcaption>
</figure>
<p>It can be observed that beyond a certain point,
increasing the batch size does not result in any significant increase in the amount of audio produced.
In total, it is possible to generate <strong>3.5 hours of speech in just 30 seconds</strong>,
with each user starting to receive audio in as little as 100 milliseconds after the request.
Check out the <a href="https://developer.nvidia.com/blog/getting-real-time-factor-over-60-for-text-to-speech-using-riva/">performance of classical combination of Tacotron2 and Waveglow</a> for comparison.</p>

<p>There are other parameters that affect Latency/RTF,
but these are hardcoded into the server and cannot be adjusted:</p>

<ul>
  <li><strong>Chunk size</strong> - the amount of audio synthesized in a single processing unit.
It is more efficient to synthesize larger chunks of audio, but this can increase latency.
The chunk size for Balacoon TTS is set at 2 seconds.</li>
  <li><strong>Batching queue delay</strong> - the time to wait for the new requests before sending previously
obtained ones as a batch. Balacoon TTS aggregates requests for 10ms.</li>
</ul>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="TTS service" /><summary type="html"><![CDATA[In recent years, text-to-speech technology has made tremendous strides, thanks in large part to advances in machine learning and artificial intelligence. As a result, synthetic speech is now almost indistinguishable from human speech, and is being used in a variety of applications, from voice assistants to audiobooks.]]></summary></entry><entry><title type="html">Balacoon TTS version 0.1.0</title><link href="https://balacoon.com/blog/streaming_synthesis/" rel="alternate" type="text/html" title="Balacoon TTS version 0.1.0" /><published>2023-03-15T00:00:00+00:00</published><updated>2023-03-15T21:20:02+00:00</updated><id>https://balacoon.com/blog/streaming_synthesis</id><content type="html" xml:base="https://balacoon.com/blog/streaming_synthesis/"><![CDATA[<p>We’re excited to announce the release of Balacoon TTS 0.1.0,
the latest version of our text-to-speech package.
This new version includes two major updates that will significantly enhance its functionality.</p>

<ul>
  <li>We switch to the use of ONNX as the neural backend. It allowed us to drop the torch libraries and reduce the package size by a factor of 3, making it much more lightweight and easy to use. Using ONNX also provided a 1.4x speedup in synthesis speed</li>
  <li>We add streaming synthesis API for low latency applications. While streaming synthesis is generally 2x slower due to redundant computations, it allows for audio to be sent back to the user immediately after the first chunk is produced, making it ideal for real-time applications. You can find the usage example in the <a href="https://balacoon.com/use/tts/package#running-streaming-synthesis">docs</a>.</li>
</ul>

<p>One caveat is that the updates required us to retrain the <a href="https://huggingface.co/balacoon/tts">TTS models</a>.
So you will need to update both package and addons.</p>

<h2 id="onnx-runtime">ONNX Runtime</h2>

<p><a href="https://onnxruntime.ai/">ONNX Runtime</a> is a powerful open-source engine that provides a universal neural backend
for deploying and optimizing deep learning models trained with different frameworks.
It simplifies the release of a library to different platforms
(<em>Windows, RaspberryPi, Android are in the roadmap</em>)
and allows for <a href="https://fs-eire.github.io/onnxruntime/docs/performance/tune-performance.html">different optimizations</a>.
Additionally, it enables the export of models to even faster backends such as TensorRT, which we will explore
in the future. At present, we plan to use ONNX as a backend for CPU inference,
although there are still some unresolved issues to address, such as half-precision inference on CPU.</p>

<h2 id="streaming-synthesis">Streaming synthesis</h2>

<p>Streaming speech synthesis is an important technology that enables real-time generation of speech while
reducing perceived latency<a href="#1">[1]</a>. This approach to speech synthesis breaks down the process of speech
generation into smaller chunks, allowing the system to produce and deliver audio output in near real-time.
This is particularly important for applications where low latency is critical, such as voice assistants,
interactive voice response (IVR) systems, and chatbots. While streaming speech synthesis offers a faster
response time, it comes at the cost of overall inference speed, as the system is constantly generating small
audio segments in real-time. Despite this, streaming synthesis remains essential for applications where
real-time audio feedback is necessary.</p>
<figure style="width: 300px" class="align-center">
  <img src="https://balacoon.com/assets/images/streaming_synthesis.png" alt="" />
  <figcaption class="figure-caption text-center">Streaming synthesis in action</figcaption>
</figure>
<p>The picture above illustrates the operating principle of streaming synthesis.
The process begins with a frontend that takes in textual input and sends it to an encoder,
which processes the input at the phoneme level.
The encoder then upsamples the phonemes to create frame-level representations.
<strong>A decoder then slides across these frame-level representations,
converting them into audio output one small chunk at a time.</strong>
By breaking down the speech synthesis process into smaller pieces,
the system can produce and deliver speech output in real-time,
reducing latency and enabling applications that require fast response times.</p>

<h2 id="references">References</h2>
<p><a id="1">[1]</a>
High Quality Streaming Speech Synthesis with Low, Sentence-Length-Independent Latency. <a href="https://arxiv.org/abs/2111.09052">arxiv</a></p>]]></content><author><name>Balacoon</name></author><category term="Blog" /><category term="text-to-speech" /><category term="speech synthesis" /><category term="streaming TTS" /><summary type="html"><![CDATA[We’re excited to announce the release of Balacoon TTS 0.1.0, the latest version of our text-to-speech package. This new version includes two major updates that will significantly enhance its functionality.]]></summary></entry></feed>