2 minute read

Zero-shot speech generation requires massive amounts of data. Large-scale speech datasets however are commonly collected for ASR and therefore sampled at 16khz. LibriTTS-R[1] work suggests that audio enhancement and super-resolution methods can be beneficial for TTS data processing. This blog compares a few open-source upsampling methods, aiming a usecase of preparing a TTS dataset (24kHz) from an ASR one (16kHz).

Methods

We compare the following methods:

  • SoX - vanilla upsampling via interpolation, the higher frequencies are not restored.
  • Resemble Enhance - diffusion-based denoising / enhancement / upsampling.
  • AudioSR - diffusion-based audio super resolution.
  • AP-BWE - GAN-based bandwidth extension in spectral domain.

Processing speed per file on a single RTX 3090:

Method Resemble Enhance AudioSR AP-BWE
Processing speed per file, sec 8.0 5.0 0.035

We measure intelligibility, audio quality and speaker similarity of the processed audio on two datasets: VCTK[2] and DAPS[3]. First represents clean audio that simply lacks higher frequencies. Second - is a more challenging usecase of speech recorded on consumer microphones with background noise present. We use ECAPA[4] speaker encoder by Speechbrain to extract speaker representation and measure speaker similarity. We run speech recognition with Conformer-Transducer ASR model by NVIDIA to evaluate intelligibility in terms of Character Error Rate. Finally, we use a pre-trained Mean Opinion Score estimator UTMOS[5] to access the naturalness. Keep in mind that all the metrics are computed on 16khz audio, so they are mainly tracking if upsampling introduces any changes to the information that is already there.

VCTK testset

A 2k subset is sampled.

Method Naturalness(MOS↑) Intelligibility(CER, %↓) Similarity(inverted cosine distance↓)
original audio 4.078 0.178 0
SoX 4.075 0.178 0.002
Resemble Enhance 3.86 0.18 0.079
AudioSR 4.05 0.178 0.039
AP-BWE 4.06 0.178 0.042

Example of the upsampling in spectral domain:

Audio samples:

DAPS testset

Dataset is segmented into sentence-level, a 2k subset is sampled.

Model Naturalness(MOS↑) Intelligibility(CER, %↓) Similarity(inverted cosine distance↓)
original audio 2.48 2.75 0
SoX 2.48 2.748 0.008
Resemble Enhance 3.32 12.98 0.43
AudioSR 2.45 3.15 0.07
AP-BWE 2.456 2.73 0.007

Example of the upsampling in spectral domain:

Audio samples:

Conclusions

Resemble-Enhance strives to also perform denoising and enhancement. It corrupts the noisy audio files substantially which is reflected in greatly degraded intelligibility. Both AudioSR and AP-BWE are very gentle to existing information and do not change the metrics. Former adds more details and combines with existing high-freq information more smoothly. Latter is however almost 150x faster. Our pick is AudioSR if the amount of data is managable, otherwise AP-BWE.

References

[1] LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

[2] CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit

[3] Device and Produced Speech (DAPS) Dataset

[4] ECAPA-TDNN Embeddings for Speaker Diarization

[5] UTMOS: UTokyo-SaruLab System for VoiceMOS Challenge 2022

Updated: