The Arabic Alphabet consists of twenty eight letters, twenty five of which represent consonants. The remaining three letters represent the long vowels of Arabic. There are six vowels in Arabic divided into three pairs consisting of a short vowel and a long vowel. Each pair corresponds to a different phonetic value. A distinguished feature of the Arabic writing system is that short vowels are not represented by the letters of the Alphabet. Instead, they are marked by so-called diacritics, short strokes (marks) placed either above or below the preceding consonant. The process of adding all of the diacritics to an unmarked text is called diacritization.
Modern written Arabic texts are almost never diacritized (composed in script that leaves out the vowels of the words). However, native speakers can generally vocalize (diacritize) words in a text based on their context and knowledge of the grammar and lexicon of the language.
When vowel marks are not used in Arabic text, there is a multitude of possible vowel combinations for the same set of characters which constitute the word. On one hand all of these combinations are correct in the sense that the form is valid, but on the other hand not all of them are correct in the context in which the word is used. Because many words with different vowel patterns, may appear identical in a vowel-less setting, considerable ambiguity exists at the word level (lexical ambiguity). Recent studies revealed that about 74% of the words in an Arabic text are lexically ambiguous. This lexical ambiguity must be resolved by contextual information identifying all the Arabic word correct diacritics except diacritics at word ends signalling grammatical case endings (their use is somewhat optional depending on the formality of the language and on the speaker).
Contributing to Arabic lexical ambiguity, is the fact that Arabic morphology is complex. Studies show that there are about five possible different morphological analyses per Arabic word on average. Prefixes and suffixes can be attached to words in a concatenative manner. A single string can comprise verb inflections, prepositions, pronouns, and connectives. Therefore, word lexical disambiguation and vowel restoration in Arabic text is a challenging task.
Without disambiguation of Arabic words, it is impossible to determine how to pronounce a non-diacritized text. There are many words for which multiple pronunciation are possible and software applications such as Arabic Text-To-Speech (TTS) cannot function properly. Restoring the diacritized form of Arabic scripts, after lexical disambiguation, would be also very helpful for non-native speakers, and could assist in diacritizing beginners' texts, such as children's school books, and poetry books, a task that is currently done manually.
The problem of current methods for automatic diacritizating Arabic scripts, is that word lexical ambiguity severely degrades the diacritization word accuracy rate.
Current approaches include,                Statistically based approaches: a bigram Hidden Markov Model is used to gain contextual information and to restore vowels. However, the problem of unknown words not found in the training corpus, is not addressed. The use of a sufficiently large modern corpus of diacritized text leads in a blow-up in term of number of model parameters as they are quadratic in the number of word types in the training set.        Morphology based approaches: these techniques are word based and cannot disambiguate words in context. They output all possible analyses for each word in the text and rely on handcrafted rules and lexicon that govern Arabic morphology. But it is still unclear how the most likely parse can be chosen given the context.        
A successful vowel restoration in Arabic script is mandatory for important applications such as Arabic Text-To-Speech (TTS) systems). Therefore, a robust method, not sensitive to unseen words in the training corpus and able to solve the lexical ambiguity of words in Arabic texts, is needed.