Optical character recognition (OCR) generally involves translating images of text into an encoding representing the actual text characters. OCR techniques for text based on a Latin script alphabet are widely available and provide very high success rates. Handwritten text generally presents different challenges for recognition than typewritten text.
Character recognition of handwritten text may be divided into offline recognition and online recognition. An offline recognition system first scans a handwritten text document, and then processes the scanned text. Offline recognition does not require immediate interaction with the user, and accordingly, is not performed in real-time. In contrast, online recognition involves a user writing text on a digital device (e.g., a graphics tablet) using a compatible writing instrument (e.g., a digital pen). The online recognition system samples the text as a sequence of two-dimensional points in real-time. Therefore, online recognition requires tracking temporal data as well as spatial data.
Recognition of handwritten non-cursive Latin script generally involves some form of segmentation. The segmentation process divides the entire text into separate text lines, divides each text line into individual words, and separates each word into individual characters. The recognition of each character then proceeds accordingly. However, with cursive text, segmentation is inherently difficult. This difficulty is further compounded with particular languages, such as those using Arabic script.
The Arabic script contains 28 basic letters, along with additional special letters, and several diacritics. Arabic text is written from right to left, in a cursive style, and is unicase (i.e., there is no uppercase and lowercase). Many features inherent in Arabic script complicate character recognition. One such feature is the interconnectedness of the letters. As the cursive nature of the Arabic script indicates, most letters are written attached to one another. Therefore, it is difficult to determine where a certain letter begins and where that letter ends, which in turn makes segmentation problematic. Furthermore, certain letters do not connect to the following letter in the word. As a result, a single Arabic word may be composed of several “word-parts”, each such word-part being a group of interconnected letters.
Another feature of Arabic script is context dependence. The shape of certain letters depends on the position of the letter within the word. For example, the letter  (ayn) appears as:  (isolated);  (initial);  (medial); and  (final). The Arabic alphabet also contains a number of ligatures, such as  (lam alif—isolated).
A further feature of Arabic script is the presence of dots and strokes. Most Arabic letters contain dots in addition to the letter body. For example, the letter  (sheen) is made up of the same letter body as  (seen) with three dots above. Several Arabic letters also contain strokes that attach to the letter body, such as with the letters:  (kaf),  (tah), and  (lam alif). In general, these dots and strokes are known as “delayed strokes”, since they are usually drawn last in a handwritten word. Many letters are differentiated solely in the number and position of the dots or strokes relative to the letter body. By eliminating, adding, or moving a dot or stroke, a different letter may be generated, which may in turn result in a completely different word. For example, the word:  (EzAm) [lion] differs from the word:  (grAm) [love] only in the position of the sole dot in the word. Another example is the word:  (Erb) [Arab], which differs from the word:  (grb) [west], only in the absence of the dot above the first letter. The existence of delayed strokes associated with different letters clearly causes difficulties for a segmentation approach to recognition.
Furthermore, different variants of delayed strokes may further complicate recognition. For example, the two dots written above or below a letter: (••) is sometimes written as a dash: (-). Similarly, the three dots written above a letter: (∴) is sometimes written as a circumflex: (^). Arabic script also contains diacritics, which are optional marks placed above or below letters, mainly for representing short vowels or consonant doubling.
In addition, a top-down writing style is very common in Arabic script, where letters in a word are written above consequent letters. As a result, it is difficult to predefine the position of letters relative to the base line of the word, further complicating the task of recognition.
Most of the existing techniques for Arabic character recognition are directed to offline recognition. Several techniques for online recognition of Arabic script focus on isolated Arabic letters only, rather than comprehensive text. The majority of such techniques operate based on segmentation, and attempt to distinguish between individual letters in each word. Other techniques for online Arabic handwriting recognition known in the art utilize decision-tree modeling, or neural networks.
U.S. Pat. No. 5,933,525 to Makhoul et al entitled “Language-independent and segmentation-free optical character recognition system and method”, is directed to an OCR system that is independent of the text language, and does not involve word or character segmentation. The OCR method involves a training component and a recognition component. During training, the system receives scanned text as input, along with a sequence of characters that correspond to the different lines in the input. After preprocessing and feature extraction, the system estimates context-dependent character models, using a lexicon and grammar (also built during training using language modeling), and a set of orthographic rules. The orthographic rules specify certain aspects of the writing structure of the language (e.g., direction of text, ligatures, diacritics, syllable structure, and the like). The OCR system generates a Hidden Markov Model (HMM) for each character in the text, with the HMM parameters estimated from the training data. The training enables recognition for the particular language in which the system was trained, and for which the lexicon and grammar was established. During recognition, the system preprocesses the text page (i.e., deskewing the page and locating the text lines), and divides each line into a sequence of overlapping frames. Each frame is a narrow strip (i.e., a vertical strip for horizontally read text) arranged sequentially in the direction of the scan. Each frame is further divided into a plurality of cells, aligned along an axis perpendicular to the direction of the scan (i.e., vertically aligned for horizontally read text). Each cell may employ a matrix of detectors for detecting whether the portion of the image is light (i.e., not part of the scanned text) or dark (i.e., part of the scanned text). The system extracts features of the data for each frame. The features preferably include the percentage of black pixels within each of the cells (i.e., intensity as a function of vertical position). The features further preferably include the vertical and horizontal derivatives of intensity (i.e., a measure of the boundaries between the light and dark portions of the text), and the local slope and correlation across a window of a plurality of cells (i.e., the angle of a line). The recognition stage proceeds to find the sequence of characters with maximum probability, given the sequence of feature vectors that represent the input text. Given the analysis of a line of text, the OCR system searches for the most likely sequence of characters, given the sequence of input features, the lexicon, and the language model. The OCR system uses the Viterbi algorithm, or a multi-pass search algorithm, to calculate the most likely sequence of characters.
U.S. Pat. No. 6,370,269 to Al-Karmi et al entitled “Optical character recognition of handwritten or cursive text in multiple languages”, is directed to a method and apparatus for recognition of cursive text in one or more languages from a scanned image. The method involves retrieving a first set of language-specific rules, which contains at least one representation of each character in the language. The method identifies sub-words in the text, where each sub-word is an intra-connected portion of a word bounded by a break in the cursive text. Each sub-word is encoded into a sequence of directional vectors in a plane. The sequence of directional vectors is processed by a non-deterministic state machine, to determine the sequence of characters in the text corresponding to the sub-words. The state machine retrieves the next vector sequence from the encoded sequence of vectors, and compares the vector sequence with the first set of language-specific rules. If a vector sequence is recognized as a character (i.e., matches an entry in the language-specific rules), the state machine parses the text by entering a character marker after the vector sequence. Once all vector sequences of a particular sub-word are recognized, the state machine proceeds to the next sub-word. However, if not all vector sequences contribute to recognition of all the characters (i.e., if an accept marker does not immediately precede the end of a sub-word), then the entire sequence of vectors corresponding to the sub-word is reparsed, by moving the character marker forward or backward one vector sequence at a time, until each vector sequence contributes to the recognition of all the characters of the sub-word. If the state machine cannot recognize all characters even after reparsing, another set of language-specific rules are retrieved. The state machine then compares the vector sequence to the elements in the second set of language-specific rules, to try and recognize the characters. If all available sets of language-specific rules have been consulted and there is no match, the vector sequence is indicated as unrecognized.
U.S. Pat. No. 6,920,247 to Mayzlin et al, entitled “Method for optical recognition of a multi-language set of letters with diacritics”, is directed to a method for character recognition in text containing characters with diacritics. The first step in the method involves digitizing a form document containing text. A text field is located and selected in the digitized image for subsequent recognition. The text field is processed to ensure that the individual characters are confined within predetermined dimensional boundaries. The number of characters in the text field is determined, each character is isolated, and the recognition is performed upon the isolated character. The character is classified based on character type and language, and noise removal is performed. The base of the character is then separated from the diacritic, by constructing a bounding box encompassing the base, and a bounding box encompassing each of the diacritics. If no diacritic was determined, the character undergoes whole character recognition. If a diacritic was determined, the boundaries of the diacritic are first determined, followed by recognition of the diacritic. A copy of the character image is created without the diacritic boundary marks, and recognition is performed on the base of the character. Subsequently, a diacritic matching algorithm is performed to determine if the diacritic can be used in combination with the base of the character. If an acceptable matching combination between the base and diacritic is found for a particular language, the base and diacritic combination is recognized accordingly.
“A New Pattern Matching Approach to the Recognition of Printed Arabic” by Obaid, p. 106-111 [http://acl.ldc.upenn.edu/W/W98/W98-1015.pdf] discusses a segmentation-free method for optical character recognition of printed Arabic text. The text first undergoes pre-processing, which includes minor noise removal, skew correction, line separation, normalization of text lines, word separation, dot extraction, thinning of isolated words, and smoothing of word skeletons. Special points are identified in the interior of the characters of the text (referred to as “focal points”). The focal points are, for example, line ends, junctions, or special patterns, and are selected to be easy to detect, immune to distortions, and of pronounced appearance in all font variations. A series of markers (referred to as “N-markers”) are distributed over each character, in a certain position relative to the focal point. The different characters are then classified based on the focal points and marker configurations. The precise combination of the presence or absence of different types of N-markers serves to distinguish the characters from one another. Afterwards, post-processing is performed to correct recognition errors and other side effects. Post-processing includes utilizing “redundancy removal rules” for designing efficient N-marker configurations; utilizing “dot and ‘Hamza’ association rules” for recognition of characters differentiated solely by the presence of dots or Hamza'; “ambiguity resolution rules” for handling ambiguities between ‘Hamza’ and dots in poor quality text; and utilizing “combining shape rules” for connecting sub-characters into characters.