1. Technical Field
The present disclosure relates to classification tasks pertaining to natural language processing. More specifically it describes systems for generating labels for rarely encountered or previously unencountered words.
2. Introduction
Predicting task labels for words that have either not been observed at all or have not been observed sufficiently frequently is one of the key challenges for empirical approaches to Natural Language (NL) tasks. For frequently encountered words in a corpus, most modeling techniques estimate statistics reliably and predict labels for such words quite accurately. However, for infrequent words (including rare words and unseen words), the label prediction accuracy of most models is significantly lower compared to the label prediction accuracy for frequently encountered words in a corpus.
Several techniques attempt to address this issue, but each technique presents various drawbacks and additional problems. For tagging tasks such as part-of-speech (POS) tagging, orthographic features of a word, such as whether a word is upper case or not, whether it is a digit or not, the word's suffix and prefix characters, can be used during training. Preferring sparse models using regularization terms is another way to generalize the models. Use of prior resources such as dictionaries in POS tagging and Gazetteers in named entity recognition tasks are other ways to address this issue.
The central problem is that in a localist representation, which includes words written as character sequences, establishing similarities between two words is non-trivial. Word similarity can be based on orthography as observed in morphologically related words such as hit and hitting, syntactically similar words such as transitive verbs hit and kick, and semantically similar words such as hit and beat.
Previous work typically decouples the construction of latent representations for words from the task the latent representations are used to solve. Latent representations are learned using task-agnostic criteria based on similarity metrics of discrete representations. While such representations might result in human-interpretable lexical neighborhoods, they may not be optimal for solving the task at hand.