learn_to_normalize.evaluation.google_data.GoogleDataIterator
- class learn_to_normalize.evaluation.google_data.GoogleDataIterator(location: str, subset: str = 'test', n_utterances: int = -1)[source]
Data iterator over Google text normalization data (https://www.kaggle.com/datasets/richardwilliamsproat/text-normalization-for-english-russian-and-polish). Unpacked data contains multiple text files with one token per line, that looks like that:
PLAIN Brillantaisia <self> PLAIN is <self> PLAIN a <self> PLAIN genus <self> PLAIN of <self> PLAIN plant <self> PLAIN in <self> PLAIN family <self> PLAIN Acanthaceae <self> PUNCT . sil <eos> <eos>
Data iterator parses those data files and composes pairs of unnomralized/normalized utterances. It needs to tackle punctuation marks and spelling.
- __init__(location: str, subset: str = 'test', n_utterances: int = -1)[source]
constructor of google data iterator
- Parameters
- location: str
directory with the data, for ex. downloaded and unpacked https://storage.googleapis.com/kaggle-data-sets/869240/1481083/compressed/en_with_types.tgz.zip
- subset: str
subset of the data to iterate over. supported values:
- test - conventional test set of google dataset.
For english its first 100002 tokens of output-00099-of-00100
all - iterate over all the data
ADDRESS, CARDINAL, … - selects utterances with specific semiotic class present
- n_utterances: int
number of utterances to read from subset