learn_to_normalize.grammar_utils.BaseFst
- class learn_to_normalize.grammar_utils.BaseFst(name: str)[source]
Base class for text normalization rules. Wrapper around pynini FST, implements some common functions used in tokenization / verbalization
BaseFST implements a logic of connecting transducer to itself, for ex. when it is allowed to connect a semiotic class to itself. It is expected that implementations of BaseFst would first define self._single_fst and then can call
connect_to_self()
multiple times. At usage (when merging all transducers together), one just refers to fst which returns multi or single fst depending on what’s available.When reusing fst in other semiotic classes you probably want to access single_fst though.
- add_tokens(fst: pynini.FstLike) pynini.FstLike [source]
Wraps fst into curly brackets and prepends with name of grammar. Used in tokenization/classification
- Parameters
- fst: pynini.FstLike
fst to wrap
- Returns
- fst: pynini.FstLike
fst wrapped with grammar names
- apply(text: str) str [source]
helper method to apply the grammar to input text
- Parameters
- text: str
input string to apply transducer to
- Returns
- res: str
transduced string. In case of tokenize/classify - returns string parsable into protobuf. In case of verbalization, converts the text into spoken form
- connect_to_self(connector_in: Union[str, List[str]], connector_out: Union[str, List[str]], connector_spaces: str = 'any', weight: float = 1.0, to_closure: bool = False, to_closure_connector: bool = False)[source]
Helper function which connects self.fst to itself through intermediate connector. Should be applied at final stage of creating classification transducer For example, allows to connect cardinals with a dash, i.e. “28 - 40” which means range. It changes self.fst to self.fst | (self.fst + connector + self.fst)
- Parameters
- connector_in: Union[str, List[str]]
which connector tokens to look for. either single connector or multiple
- connector_out: Union[str, List[str]]
what is the expansion of a connector. For example “-” in case of range is expanded to “to”. If its none, transducer just deletes strings from connector_in
- connector_spaces: str
defines which spaces are allowed around connector
any - means can be no or any number of spaces both form left and right from connector none_or_one - means there is no spaces around connector or one from each side, for ex. 1:2 or 1 : 2. none - there shouldn’t be any spaces around connector
- weight: float
weight to add to multi-token branch
- to_closure: bool
if True, allows multiple repetitions of (connector + fst)
- to_closure_connector: bool
if True, also closure connector, so multiple occurrences of same connector between tokens are allowed