en_us_normalization.production.classify.ElectronicFst
- class en_us_normalization.production.classify.ElectronicFst[source]
Finite state transducer for classifying electronic: as URLs, email addresses, etc. Electronic semiotic class can contain multiple optional fields:
protocol - in front of urls there can be a protocol, such as “http://” or “mailto://”
username - before domain, there might be a used name, separated with “@”. Most common case - username in email address, for ex “clement@balacoon.com”. Within username there might be a password separated with a colon symbol, for ex. “user:123@gmail.com”
domain - smth that goes after protocol and username. can represent the whole URL, the only non-optional in electronic address. domain may have optional prefix, such as “www.” and mandatory suffix that helps to identify domain. Suffixes are 2-3 letters long (such as “com” or “io”), can be repeated (for ex. “com.ua”) and separated with a dot. Usually suffixes should be spelled in pronunciation, but in some cases, they should be pronounced as a regular word (for ex. “com”).
port - optional digits after domain separated with colon symbol, for ex. “google.com:8080”
path - something that follow domain after a slash “/”. Could be any symbols.
Examples of strings that should be classified as electronic semiotic classes and their tagging:
cdf1@abc.edu -> electronic { username: “cdf1” domain: “abc.edu” }
http://cat:dog@www.google.com:8080/?231eds2@90iu -> electronic { protocol: “http” username: “cat” password: “dog” domain: “www.google.com” port: “8080” path: “/?231eds2@90iu” }