Usage

In order to build text normalization addon:

  1. get the repo

git clone git@github.com:balacoon/learn_to_normalize.git
  1. build docker that manages all the dependencies

# if "build-tn" is specified, text_normalization
# is built from sources. You need special access for it
# which you likely dont have.
bash docker/build.sh [--build-tn]
  1. get text normalization rules. Adjust those if needed, but don’t forget to share changes as a contribution.

# text normalization rules are stored as submodules, pick one you need
# from grammars dir
git submodule update --init grammars/en_us_normalization/
  1. launch docker and execute addon creation. This will just compile text normalization rules and pack them.

# script is really simple shortcut to start container. Adjust it
# if needed
bash docker/run.sh
# create addon
learn_to_normalize --locale en_us --work-dir work_dir \
    --resources grammars/en_us_normalization/production/ \
    --out en_us_normalization.addon
  1. learn_to_normalize contains interactive demos for debugging and to showcase how to use obtained artifacts.

# executing single grammar to debug it
demo_grammar --grammars grammars/en_us_normalization/production/ --module classify.time --name TimeFst
# using packed addon
demo_normalize --addon work_dir/normalization.addon
  1. finding flaws in rules, checking stability and evaluating performance of built rule-set is essential next step:

learn_to_normalize.evaluation

Text normalization is a complex non-determinstic task with long tail of errors.