en_us_normalization.production
Text normalization rules for english are adapted from NVIDIA: https://github.com/NVIDIA/NeMo/tree/main/nemo_text_processing/text_normalization/en
Classification
1st step splits text into tokens and classifies tokens into semiotic classes. this is a big FST that converts input text to a tagged sequence:
“12/04/15 at 3:30pm!”
{ left_punct: “”” date { month: “december” day: “4” year: “15” } } tokens { name: “at” } tokens { time { hours: “3” minutes: “30” suffix: “PM” } right_punct: “!”” }
Tokenize and classify rules
Rules to split input string into tokens and classify them into semiotic classes. Rules parse tokens into pre-defined fields for each semiotic class, put it into pre-defined format, which is parsable into protobuf structure and can be further passed for verbalization, i.e. conversion into spoken form.
Combining everything together into single FST:
Final class that composes all other classification grammars. |
Acceptor for words that doesn’t require normalization:
Finite state transducer for classifying words - smth that doesn't need verbalization, i.e. it is already normalized and contains letters that are all known to pronunciation dictionary. |
Rules for classification of different semiotic classes:
Grammar for classifying abbreviations e.g.: |
|
Finite state transducer for classifying address. |
|
Finite state transducer for classifying cardinals - numbers expressing amount. |
|
Finite state transducer for classifying dates. |
|
Finite state transducer for classifying decimal, i.e. numbers with fractional part. |
|
Finite state transducer for classifying electronic: as URLs, email addresses, etc. |
|
Finite state transducer for classifying fraction, for ex. |
|
Finite state transducer for classifying measure, suppletive aware, i.e. 12kg -> 12 kilograms, but 1kg -> 1 kilogram. |
|
Finite state transducer for classifying money, suppletive aware. |
|
Finite state transducer for classifying ordinal, i.e. cardinals with suffix In english there are just 4 suffixes to take of:. |
|
Finite state transducer for classifying romans (III, IV, etc). |
|
Finite state transducer for discovering shortenings, such as Mrs. |
|
Finite state transducer for classifying telephone numbers. |
|
Finite state transducer for classifying time. |
|
Finite state transducer for classifying verbatims - anything that has extra symbols and doesn't match available semiotic classes. |
Sometimes tokens are not separated by whitespace, but with special connectors. This requires introduction of multi-token FSTs:
Tagging with multi-token rules
In some contexts, semiotic classes are connected with symbols that needs to be read out loud. For example “5 x 3” is “five times three”, not “five eks three”. At verbalization, tagged tokens are processed separately, but at classification a single multi-token FST is needed.
Attached tokens tries to deal with multi-token string which have dash as a separator or doesn't have any separator at all. |
Verbalization
Tagged tokens are parsed with text_normalization and semiotic classes are passed for verbalization. During verbalization, serialized tokens are converted to spoken form:
date|month:december|day:4|year:15|
december fourth fifteen
Verbalization rules
Rules to convert semiotic classes into spoken form. Regular words are not passed for verbalization. Each semiotic class has predefined set of fields, that verbalization grammar should take care of. Verbalization grammars drop field names and transduce field values into words.
Combining everything together into single FST:
Final class that composes all other verbalization grammars. |
Rules for verbalization of different semiotic classes:
Finite state transducer for verbalizing cardinal number. |
|
Finite state transducer for verbalizing ordinal. |
|
Finite state transducer for verbalizing decimal, i.e. number with integer and fractional part. |
|
Finite state transducer for verbalizing fraction. |
|
Finite state transducer for verbalizing roman numerals. |
|
Finite state transducer for verbalization of dates. |
|
Finite state transducer for verbalizing verbatim, i.e. any leftovers after classification into semiotic classes. |
|
Finite state transducer for verbalizing electronic addresses. |
|
Finite state transducer for verbalizing measures. |
|
Finite state transducer for verbalizing money. |
|
Finite state transducer for verbalizing time. |