Alphabets in the Arabic language consist of twenty-eight letters. Of the twenty-eight letters, twenty-five letters represent consonants. The remaining three letters in the Arabic language represent long vowels of Arabic. In addition, the Arabic language consists of six vowels that are divided into three pairs consisting of a short vowel and a long vowel. As such, each pair corresponds to a different phonetic value. A distinguished feature of the Arabic writing system is that short vowels are not represented by the alphabets. Instead, the alphabets are marked by so-called diacritics i.e. short strokes which are placed either above or below the preceding consonant. This process of adding the diacritics to non-diacritized text is called diacritization.
Modern written text in Arabic language is almost never diacritized i.e. the modern written Arabic texts are composed of the Arabic alphabets that leaves out the vowels of the words. However, diacritics perform an essential function in pronouncing a certain word. In general, in Arabic texts, there are multitudes of possible vowel combinations for the same set of alphabets that constitute a word. On one hand, each word formed using multitudes of possible vowel combination is correct in the sense that the form is valid. However, on the other hand the context in which these words are used, not all words formed this way are correct. To illustrate this with an example: consider the following word “” which can be pronounced as either “”—college, or “”—kidney. Thus, for an undiacritized Arabic word there may be a vast number of pronunciations whereas for diacritized Arabic words it is only one pronunciation. In spite of this importance, Arabic texts may be undiacritized and readers of Arabic texts are accustomed to inferring the meaning from the context of the Arabic texts and knowledge of the grammar and lexicon of the Arabic language.
Therefore, lexical ambiguity exists at the word level in the Arabic text. Recent studies revealed that about 74% of the words in the Arabic text are lexically ambiguous. Contributing to lexical ambiguity in the Arabic text is the fact that Arabic morphology is complex. On an average, there are five possible different morphological analyses per Arabic word. In addition, prefixes and suffixes can be attached to words in Arabic texts in a concatenative manner. Therefore, word lexical disambiguation and vowel restoration in Arabic texts is a challenging task.
As such, with word lexical ambiguity it is impossible to determine how to pronounce a non-diacritized Arabic text. Moreover, multiple pronunciations are possible for many words in the Arabic language. Hence, automatically restoring the diacritized form of Arabic texts would be helpful for non-native speakers. In addition, a task of diacritizing beginners' texts, such as children's school books, which is currently done manually, may be performed automatically and effortlessly.
Therefore, there is need for a method and system to automatically diacritize a non-diacritized text in the Arabic language.