The present application relates generally to automatic recognition of Arabic text.
Text recognition, namely, automatic reading of a text, is a branch of pattern recognition. The objective of text recognition is to read printed text with human accuracy and at a higher speed. Most text recognition methods assume that text can be isolated into individual characters. Such techniques, although successful with Latin typewritten or typeset text, cannot be applied reliably to cursive script such as Arabic. Previous research on Arabic script recognition has confirmed the difficulties in attempting to segment Arabic words into individual characters.
Arabic language provides several challenges for text recognition algorithms. Arabic scripts are inherently cursive and it is unacceptable to write isolated characters in block letters. Moreover, the shape of an Arabic letter can be context sensitive; that is it can depend on the location of the letter within a word. For example a letter as  has four different shapes: isolated  as in , beginning  as in , middle  as in , and end  as in . Furthermore, not all Arabic characters are connected within a word. It can be difficult to automatically determine boundaries between words because spacing may also separate certain characters within a word. Additionally, some Arabic text are written with vowelization while others written without it; some Arabic text ignore Hamza and the points under Y a letter at end of word; some Arabic text contain words from non-Arabic languages.
Different classification systems such as statistical models have been applied to text recognition of Arabic text. However, properly extracting text features still remains as a major hurdle to achieve accurate Arabic text recognition.