1. Technical Field
The present disclosure relates to text normalization and more specifically to normalization of text in the context of social media translations.
2. Introduction
Text normalization is a prerequisite for a variety of tasks involving speech and language. Most natural language processing (NLP) tasks require a tight and compact vocabulary to reduce the model complexity in terms of feature size. As a consequence, applications such as syntactic, semantic tagging, named entity extraction, information extraction, machine translation, language models for speech recognition, etc., are trained using clean, normalized, data restricted by a user defined vocabulary.
Conventionally, most NLP researchers perform such normalization through rule-based mapping that can get unwieldy and cumbersome for extremely noisy texts as in SMS, chat, or social media. Unnormalized text, as witnessed in social media forums such as Facebook, Twitter, and message boards, or SMS, have a variety of issues with spelling such as repeating letters, eliminating vowels, using phonetic spellings, substituting letters (typically syllables) with numbers, using shorthand, and user created abbreviations for phrases. A remarkable property of such texts is that new variants of canonical words and phrases are evolving constantly.