learn_to_normalize.evaluation.google_data.GoogleDataIterator

class learn_to_normalize.evaluation.google_data.GoogleDataIterator(location: str, subset: str = 'test', n_utterances: int = -1)[source]

Data iterator over Google text normalization data (https://www.kaggle.com/datasets/richardwilliamsproat/text-normalization-for-english-russian-and-polish). Unpacked data contains multiple text files with one token per line, that looks like that:

PLAIN   Brillantaisia   <self>
PLAIN   is      <self>
PLAIN   a       <self>
PLAIN   genus   <self>
PLAIN   of      <self>
PLAIN   plant   <self>
PLAIN   in      <self>
PLAIN   family  <self>
PLAIN   Acanthaceae     <self>
PUNCT   .       sil
<eos>   <eos>

Data iterator parses those data files and composes pairs of unnomralized/normalized utterances. It needs to tackle punctuation marks and spelling.

__init__(location: str, subset: str = 'test', n_utterances: int = -1)[source]

constructor of google data iterator

Parameters

location: str

directory with the data, for ex. downloaded and unpacked https://storage.googleapis.com/kaggle-data-sets/869240/1481083/compressed/en_with_types.tgz.zip

subset: str

subset of the data to iterate over. supported values:

test - conventional test set of google dataset.
For english its first 100002 tokens of output-00099-of-00100
all - iterate over all the data
ADDRESS, CARDINAL, … - selects utterances with specific semiotic class present

n_utterances: int

number of utterances to read from subset