learn_to_normalize.grammar_utils.BaseFst

class learn_to_normalize.grammar_utils.BaseFst(name: str)[source]

Base class for text normalization rules. Wrapper around pynini FST, implements some common functions used in tokenization / verbalization

BaseFST implements a logic of connecting transducer to itself, for ex. when it is allowed to connect a semiotic class to itself. It is expected that implementations of BaseFst would first define self._single_fst and then can call connect_to_self() multiple times. At usage (when merging all transducers together), one just refers to fst which returns multi or single fst depending on what’s available.

When reusing fst in other semiotic classes you probably want to access single_fst though.

__init__(name: str)[source]

add_tokens(fst: pynini.FstLike) → pynini.FstLike[source]

Wraps fst into curly brackets and prepends with name of grammar. Used in tokenization/classification

Parameters

fst: pynini.FstLike: fst to wrap

Returns

fst: pynini.FstLike: fst wrapped with grammar names

apply(text: str) → str[source]

helper method to apply the grammar to input text

Parameters

text: str: input string to apply transducer to

Returns

res: str: transduced string. In case of tokenize/classify - returns string parsable into protobuf. In case of verbalization, converts the text into spoken form

connect_to_self(connector_in: Union[str, List[str]], connector_out: Union[str, List[str]], connector_spaces: str = 'any', weight: float = 1.0, to_closure: bool = False, to_closure_connector: bool = False)[source]

Helper function which connects self.fst to itself through intermediate connector. Should be applied at final stage of creating classification transducer For example, allows to connect cardinals with a dash, i.e. “28 - 40” which means range. It changes self.fst to self.fst | (self.fst + connector + self.fst)

Parameters

connector_in: Union[str, List[str]]

which connector tokens to look for. either single connector or multiple

connector_out: Union[str, List[str]]

what is the expansion of a connector. For example “-” in case of range is expanded to “to”. If its none, transducer just deletes strings from connector_in

connector_spaces: str

defines which spaces are allowed around connector

any - means can be no or any number of spaces both form left and right from connector none_or_one - means there is no spaces around connector or one from each side, for ex. 1:2 or 1 : 2. none - there shouldn’t be any spaces around connector

weight: float

weight to add to multi-token branch

to_closure: bool

if True, allows multiple repetitions of (connector + fst)

to_closure_connector: bool

if True, also closure connector, so multiple occurrences of same connector between tokens are allowed

delete_tokens(fst: pynini.FstLike) → pynini.FstLike[source]

Removes name grammar name from string passed for verbalization

Parameters

fst: pynini.FstLike: fst to remove grammar name from

Returns

fst: pynini.FstLike: fst without grammar name and trailing straight slash