en_us_normalization.production.classify.RomanFst

class en_us_normalization.production.classify.RomanFst(cardinal: Optional[CardinalFst] = None)[source]

Finite state transducer for classifying romans (III, IV, etc). In order to convert roman numbers, mappings from data files are used:

  • roman/digit_teen.tsv - contains mapping for numbers from 1 to 49.

  • roman/ties.tsv - contains mapping for dozens, i.e. 50, 60, …

  • roman/hunderds.tsv - contains mapping for hundreds, i.e. 100, 200, …

Roman transducer reuses cardinal transducer to accept digits. Depending on the context, specifically predecessing word, it should be possible to define if the roman digit is cardinal or ordinal.

  • roman/cardinal_prefixes.tsv - contains cardinal prefixes, such as “Chapter”

  • roman/ordinal_prefixes.tsv - contains ordinal prefixes, such as “George”

In case roman number doesn’t have a known prefix, i.e. standalone roman number, it should be treated carefully. Typical mistakes:

  • roman number can be confused with abbreviation

  • roman number that consists of a single character, such as “I”.

  • “XXX” - denotes pornographic materials, should have bigger weight

Examples of transducer input/output:

  • IV -> roman { cardinal { count: “4” } }

  • George I -> roman { prefix: “george” ordinal { order: “1” } }

  • CHAPTER XIX -> roman { prefix: “chapter” cardinal { count: “1” } }

__init__(cardinal: Optional[CardinalFst] = None)[source]

cosntructor for roman numbers transducer

Parameters
cardinal: CardinalFst

transducer for cardinal numbers to reuse. if not provided, will be created from scratch