en_us_normalization.production

Text normalization rules for english are adapted from NVIDIA: https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/en

Classification

1st step splits text into tokens and classifies tokens into semiotic classes. this is a big FST that converts input text to a tagged sequence:

“12/04/15 at 3:30pm!”

{ left_punct: “”” date { month: “december” day: “4” year: “15” } } tokens { name: “at” } tokens { time { hours: “3” minutes: “30” suffix: “PM” } right_punct: “!”” }

Tokenize and classify rules

Rules to split input string into tokens and classify them into semiotic classes. Rules parse tokens into pre-defined fields for each semiotic class, put it into pre-defined format, which is parsable into protobuf structure and can be further passed for verbalization, i.e. conversion into spoken form.

Combining everything together into single FST:

ClassifyFst

Final class that composes all other classification grammars.

Acceptor for words that doesn’t require normalization:

WordFst

Finite state transducer for classifying words - smth that doesn't need verbalization, i.e. it is already normalized and contains letters that are all known to pronunciation dictionary.

Rules for classification of different semiotic classes:

AbbreviationFst

Grammar for classifying abbreviations e.g.:

AddressFst

Finite state transducer for classifying address.

CardinalFst

Finite state transducer for classifying cardinals - numbers expressing amount.

DateFst

Finite state transducer for classifying dates.

DecimalFst

Finite state transducer for classifying decimal, i.e. numbers with fractional part.

ElectronicFst

Finite state transducer for classifying electronic: as URLs, email addresses, etc.

FractionFst

Finite state transducer for classifying fraction, for ex.

MeasureFst

Finite state transducer for classifying measure, suppletive aware, i.e. 12kg -> 12 kilograms, but 1kg -> 1 kilogram.

MoneyFst

Finite state transducer for classifying money, suppletive aware.

OrdinalFst

Finite state transducer for classifying ordinal, i.e. cardinals with suffix In english there are just 4 suffixes to take of:.

RomanFst

Finite state transducer for classifying romans (III, IV, etc).

ShorteningFst

Finite state transducer for discovering shortenings, such as Mrs.

TelephoneFst

Finite state transducer for classifying telephone numbers.

TimeFst

Finite state transducer for classifying time.

VerbatimFst

Finite state transducer for classifying verbatims - anything that has extra symbols and doesn't match available semiotic classes.

Sometimes tokens are not separated by whitespace, but with special connectors. This requires introduction of multi-token FSTs:

Tagging with multi-token rules

In some contexts, semiotic classes are connected with symbols that needs to be read out loud. For example “5 x 3” is “five times three”, not “five eks three”. At verbalization, tagged tokens are processed separately, but at classification a single multi-token FST is needed.

AttachedTokensFst

Attached tokens tries to deal with multi-token string which have dash as a separator or doesn't have any separator at all.

Verbalization

Tagged tokens are parsed with text_normalization and semiotic classes are passed for verbalization. During verbalization, serialized tokens are converted to spoken form:

date|month:december|day:4|year:15|

december fourth fifteen

Verbalization rules

Rules to convert semiotic classes into spoken form. Regular words are not passed for verbalization. Each semiotic class has predefined set of fields, that verbalization grammar should take care of. Verbalization grammars drop field names and transduce field values into words.

Combining everything together into single FST:

VerbalizeFst

Final class that composes all other verbalization grammars.

Rules for verbalization of different semiotic classes:

CardinalFst

Finite state transducer for verbalizing cardinal number.

OrdinalFst

Finite state transducer for verbalizing ordinal.

DecimalFst

Finite state transducer for verbalizing decimal, i.e. number with integer and fractional part.

FractionFst

Finite state transducer for verbalizing fraction.

RomanFst

Finite state transducer for verbalizing roman numerals.

DateFst

Finite state transducer for verbalization of dates.

VerbatimFst

Finite state transducer for verbalizing verbatim, i.e. any leftovers after classification into semiotic classes.

ElectronicFst

Finite state transducer for verbalizing electronic addresses.

MeasureFst

Finite state transducer for verbalizing measures.

MoneyFst

Finite state transducer for verbalizing money.

TimeFst

Finite state transducer for verbalizing time.