- class en_us_normalization.production.classify.AbbreviationFst
Grammar for classifying abbreviations e.g.:
F.B.I. -> word: “FBI”
f.b.i. -> word: “FBI”
AI -> word: “AI”
The convention is that text normalization generally returns normalized text in lower case, but words that needs to be spelled (pronounced letter by letter) are capitalized. There is no extra verbalization needed for abbreviation (apart from custom pronunciation generation), thus after classification, abbreviations are marked as regular words and are not passed for verbalization.
Rules to detect abbreviations:
Classic abbreviation - letters separated by dots upper or lower case, starting from a single letter for ex. f. or F.B.I.
Consonants abbreviation - word contains only consonants (except “y”). This can’t be pronounced and should be spelled.
Vowels abbreviation - word contains only vowels. The rule is more cautious however than for consonants. It affects only sequences of 3 letters and sequences of 2 letters if those are upper case. This is done to keep “a”, “i”, “oi” as is.
Acronyms - smth that may look like abbreviation, but is actually pronounced as a regular word. For example, “NATO” or “NASA”. This is essentially exceptions that are anti-abbreviations. Those are recognized using list from abbreviations/acronyms.tsv
Cased abbreviations - some abbreviations only make sense when they are in a specific case. For example “US” - is a country, while “us” is a regular word. Those are recognized using list from abbreviations/abbreviations_cased.tsv
abbreviations - some abbreviations are case-independent as should be recognized as abbreviation in any case, for example “usa”. Those are recognized using list from abbreviations/abbreviations.tsv
unpronounceable sequences - some sequences of letters are simply unpronounceable and they indicate that the whole word should be spelled. List of letter n-grams that can’t be pronounced is in abbreviations/ngrams.tsv. This rule is only applied to upper-case words.
ampersand abbreviation - words with upper case letters and “&” in the middle is an abbreviation, for example: “AT&T”