As part of text-normalization challenge, Google released an automatically generated dataset of unnormalized/normalized pairs. It is obtained by running Google rule-based frontend (Kestrel) over Wikipedia. More info can be found in the paper: “RNN Approaches to Text Normalization: A Challenge”.

Download and unpack the data from Kaggle: dataset page. You will need to login, but gmail account can be used.

This data can be used to

  • evaluate performance of Balacoon text normalization

  • enhance existing text normalization rules by going through mismatches

Text normalization performance

For english, original paper reports 0.998 token-level accuracy for seq2seq model with attention and FST filter. Accuracy is measured on first 100002 lines of output-00099-of-00100.

Balacoon performance is measured on sentence-level, since we have slightly different set of semiotic classes. Google data is glued back together into utterances using ParsedUtterance and fed to text_normalization package. We achieve 0.89 sentence-level accuracy.

Vast majority of errors come from discrepancy in handling abbreviations and non-determinism in expanding numbers:

Expected: fujitsu primergy RX two five four o m one
Obtained: fujitsu primergy RX two thousand five hundred forty M one
Original: "Fujitsu Primergy RX 2540 M1".

Nonetheless, some discrepancies indicate flaws of Balacoon normalization rules:

Expected: promo CD CDRDJ six seven two one seven inches R six seven two one what ya gonna do now
Obtained: promo CD CDRDJ six thousand seven hundred twenty one seven R six thousand seven hundred twenty one what ya gonna do now
Original: Promo CD CDRDJ 6721, 7" R 6721 "What Ya Gonna Do Now?"

Despite occasional inaccuracies, Balacoon rules can be used as a solid starting point to develop text-normalization fine-tuned for particular usecase.


Adapters that help to work with google data:


Data iterator over Google text normalization data (https://www.kaggle.com/datasets/richardwilliamsproat/text-normalization-for-english-russian-and-polish).


A data structure that contains unnormalized and normalized tokens parsed from a Google data file.