Optical Character Recognition (OCR) is one of the oldest problems in computer pattern recognition, and has been listed as the oldest data entry after keypunching. OCR can be defined as mechanical or electronic conversion of scanned or photoed images of typewritten or printed text into machine-encoded/computer-readable text. While OCR is a well-established technique for many languages, especially Latin and Chinese, for Arabic it is still in an early stage.
Due to characteristics of the Arabic writing system, optical character recognition thereof is far more complex than other languages. Such characteristics are: the text direction is from right to left, the cursive writing is in both handwritten and machine printed text, each character has different shapes for different positions in a word, dots and diacritical signs above and below the characters, a variable length of an elongation of connecting lines between characters, vertical or horizontal ligatures, as well as different sizes (height and width) for each character. All of these characteristics influence processing and recognition of Arabic characters in different ways, and make it impossible to use conventional character-based processing like the Latin language.
A main issue with existing Arabic OCR methods is that none of them has considered the above-mentioned characteristics of Arabic text as advantages in a recognition process. Instead, they describe these characteristics only as source of complexity.