Usage
In order to build text normalization addon:
get the repo
git clone git@github.com:balacoon/learn_to_normalize.git
build docker that manages all the dependencies
# if "build-tn" is specified, text_normalization
# is built from sources. You need special access for it
# which you likely dont have.
bash docker/build.sh [--build-tn]
get text normalization rules. Adjust those if needed, but don’t forget to share changes as a contribution.
# text normalization rules are stored as submodules, pick one you need
# from grammars dir
git submodule update --init grammars/en_us_normalization/
launch docker and execute addon creation. This will just compile text normalization rules and pack them.
# script is really simple shortcut to start container. Adjust it
# if needed
bash docker/run.sh
# create addon
learn_to_normalize --locale en_us --work-dir work_dir \
--resources grammars/en_us_normalization/production/ \
--out en_us_normalization.addon
learn_to_normalize contains interactive demos for debugging and to showcase how to use obtained artifacts.
# executing single grammar to debug it
demo_grammar --grammars grammars/en_us_normalization/production/ --module classify.time --name TimeFst
# using packed addon
demo_normalize --addon work_dir/normalization.addon
finding flaws in rules, checking stability and evaluating performance of built rule-set is essential next step:
Text normalization is a complex non-determinstic task with long tail of errors. |