1. Technical Field
The present disclosure relates to normalizing text and more specifically to language independent text normalization using atomic tokens and classification labels.
2. Introduction
Text normalization is a way of adapting text to a standard form, such as for comparison to other normalized text or for facilitating searches. One approach to data-driven text normalization is to annotate text data manually in concordance format, according to a set of category labels. This approach breaks data processing into two parts, (a) a version of Named Entity extraction, and (b) subsequent actions based on the entities. This approach seeks, approximately, to reproduce the steps that might be carried out in a traditional hand-crafted text-to-speech (TTS) system. The patterns to be classified are generally language-specific, and are typically separated by white space. This approach does not translate well to other languages. For example, when moving English to Asian languages, two major differences are calculating word boundaries, and that not all the English labels are relevant for Asian languages. The complexity of the rules required for dealing with the broad categories of text are difficult to overcome.
In Asian languages, letter expansions are generally much simpler than for English while number expansions are similar in complexity. One approach exemplified by Chinese text focuses solely on normalization rather than word splitting. This approach uses a Finite State Automaton (FSA) to give an initial classification followed by a Maximum Entropy (MaxEnt) classifier to distinguish subclasses. The Moses Machine Translation (MT) framework considers normalization to be a form of machine translation. The primary goal of the Moses MT framework is to evaluate how effective Statistical Machine Translation (SMT) is in the context of normalizing text in a language, both in terms of having unskilled “translators” and the pros and cons of combinations of SMT and language-independent and language-specific rules. None of these approaches is language neutral and none normalizes text for both TTS and automatic speech recognition (ASR) purposes.