class en_us_normalization.production.classify.ElectronicFst[source]

Finite state transducer for classifying electronic: as URLs, email addresses, etc. Electronic semiotic class can contain multiple optional fields:

  • protocol - in front of urls there can be a protocol, such as “http://” or “mailto://

  • username - before domain, there might be a used name, separated with “@”. Most common case - username in email address, for ex “clement@balacoon.com”. Within username there might be a password separated with a colon symbol, for ex. “user:123@gmail.com”

  • domain - smth that goes after protocol and username. can represent the whole URL, the only non-optional in electronic address. domain may have optional prefix, such as “www.” and mandatory suffix that helps to identify domain. Suffixes are 2-3 letters long (such as “com” or “io”), can be repeated (for ex. “com.ua”) and separated with a dot. Usually suffixes should be spelled in pronunciation, but in some cases, they should be pronounced as a regular word (for ex. “com”).

  • port - optional digits after domain separated with colon symbol, for ex. “google.com:8080”

  • path - something that follow domain after a slash “/”. Could be any symbols.

Examples of strings that should be classified as electronic semiotic classes and their tagging: