en_us_normalization.production.classify.WordFst

class en_us_normalization.production.classify.WordFst[source]

Finite state transducer for classifying words - smth that doesn’t need verbalization, i.e. it is already normalized and contains letters that are all known to pronunciation dictionary. Regular words are meant to be pronounced, so if token is classified as regular word, it is brought to lower case.

Additionally, word transducer normalizes unicode letters, such as “é”. Unicode characters and their mappings are stored in “unicode_chars.tsv”

Finally, word transducer has to handle apostrophe. It’s okay to have apostrophe inside the word, but at the beginning and at the end it can be confused with single quotation mark. There are few cases when apostrophe on a word boundary is justified:

  • It’s a shortened version of a word. For ex. “‘em” is “them”

  • Apostrophe indicates possession, for ex “Thomas’ watch”

Examples of input/output strings:

  • sleep -> name: “sleep”

  • don’t -> name: “don’t”

  • Hello -> name: “hello”

__init__()[source]