1. Technical Field
The present invention relates to diacritization (e.g., vowelization) of text and more particularly to a diacritization restoration system and method, which restores missing diacritics from text reproductions of speech and translated text.
2. Description of the Related Art
Arabic documents are composed of scripts without short vowels and other diacritic marks. The written text is actually missing indications of the vowels, since those familiar with reading the language can do so without the vowels being indicated. This often leads to a considerable ambiguity since several words that have different diacritic patterns may appear identical in a diacritic-less setting. Educated Modern Standard Arabic speakers are able to accurately restore diacritics in a document. This is based on the context and their knowledge of the grammar and the lexicon of Arabic. However, a document without diacritics becomes a source of confusion for beginner readers and people with learning disabilities.
A document without diacritics is also problematic for video, speech, and natural language processing applications, where a diacritic-less setting adds another layer of ambiguity when processing the data. Examples of these applications are automatic speech-recognition, speech-to-text, information extraction, machine translation, multimedia indexing, etc.
Fully diacritized text is required for text-to-speech applications, where the mapping from graphemes to phonemes is simple (compared to languages such as English and French, for example), and in most cases there is one-to-one relationship for such mapping. Also, using data with diacritics improves the accuracy of speech-recognition applications.
Currently, applications such as text-to-speech, speech-to-text, and others use data where diacritics may be placed manually or by rule based methods, which may be tedious, time consuming to generate and less accurate. A diacritization restoration system that could restore diacritics (i.e. supply the full diacritical markings and consequently a full vocalization), would be of interest to these applications and many other applications. In addition, a diacritic restoration system (“diacritization” and “diacritic restoration” may be used interchangeably throughout this disclosure) would greatly benefit nonnative speakers, sufferers of dyslexia, etc. It also could assist in restoring diacritics of children's and poetry books, a task that is currently done manually, among other things.
Prior to recent attention there have been relatively few studies tackling the diacritization issue in Arabic. Rule based methods based on a morphological analyzer were proposed for vowelization. One rule based method employed a grapheme to sound conversion method. The main disadvantage of rule based methods is that it is difficult to maintain up-to-date rules, or extend the method to new applications due to the productive nature of any “living” spoken language.
More recently, there have been several new studies addressing the diacritization problem. An example is based on a top-down approach adopted where each utterance to be diacritized is compared to the training data for a matching sentence. If there is a match, the whole utterance is used, if not, then phrases from the sentence are extracted to search for matches. Then, the words and finally the character n-gram models are used. New words are diacritized using character based n-gram models.
In another method, conversational Arabic is diacritized by combining morphological and contextual information with the acoustic signal. Here diacritization is treated as an unsupervised tagging problem where each word is tagged as one of the many possible diacritizations provided by a morphological analyzer. An Expectation Maximization (EM) algorithm is used to learn the tag sequences from the training data. An HMM-based diacritization method was also presented where diacritized sentences were decoded from non-diacritized sentences. This method considered a fully word based approach and considered only vowels (no additional diacritics).
Recently, a weighted finite state transducer based algorithm has also been proposed that employs characters and morphological units in addition to words. This method does not appear to handle the case of two syllabification marks (e.g., shedda) showing the doubling of the preceding consonant and sukuun denoting the lack of a vowel.
Even though the methods proposed for diacritization have been maturing and improving over time, they still provide a limited solution to the problem in terms of accuracy and diacritics coverage.