Digitizing documents into an electronic form for easy storage, retrieval, searching, and indexing is of major importance in the digital age. Highly reliable and robust document analysis and processing systems are needed to convert a huge amount of information from paper form to digital form.
A text recognition system is a core component to converting documents to a digital form. Text recognition systems are generally trained and used for handwritten and printed text. Major challenges related to text recognition exist for degraded documents, recognition of irregular and unaligned text, and recognition in polyfont text. In addition, major variances exist between the different alphabets and the different scripts. Therefore, a text recognition system may work successfully for one alphabet or script, but not for another alphabet or script with different characteristics.
Research in optical character recognition started as early as the 1940s with commercial Optical Character Recognition (OCR) machines appearing in the 1950s [J. Mantas, “An overview of character recognition methodologies,” Pattern Recognit., vol. 19, no. 6, pp. 425-430, January 1986—incorporated herein by reference in its entirety]. The earlier systems were restricted in terms of the operating conditions and the document layout, as well as the fonts which could be recognized. The current state-of-the-art allows for flexible operating conditions and the ability to deal with complex document layouts and varied fonts (e.g. [I. Marosi, “Industrial OCR approaches: architecture, algorithms, and adaptation techniques,” Proc. SPIE, vol. 6500. pp. 650002-650010, 2007; Y.-Y. Chiang and C. A. Knoblock, “Recognition of Multi-oriented, Multi-sized, and Curved Text,” in 2011 International Conference on Document Analysis and Recognition, 2011, pp. 1399-1403—incorporated herein by reference in their entireties]).
One of the earliest researches on Arabic OCR was in the 1970s [B. Al-Badr and S. A. Mahmoud, “Survey and bibliography of Arabic optical text recognition,” Signal Processing, vol. 41, no. 1, pp. 49-77, 1995—incorporated herein by reference in its entirety]. Interest in the research on Arabic text recognition and related applications has increased appreciably in the last decade. This is clear from the number of publications that resulted from this research. The description herein will be limited to related work using HMMs. HMMs are one of the most popular and state-of-the-art techniques used for text recognition and the Arabic script is cursive. HMMs are mainly used for Arabic text recognition to avoid the need of explicit segmentation of images beyond text lines. A broader perspective on text recognition can be found at [B. Al-Badr and S. A. Mahmoud, “Survey and bibliography of Arabic optical text recognition,” Signal Processing, vol. 41, no. 1, pp. 49-77, 1995; J. Mantas, “An overview of character recognition methodologies,” Pattern Recognit., vol. 19, no. 6, pp. 425-430, January 1986; V. Märgner and H. El Abed, Eds., Guide to OCR for Arabic Scripts. London: Springer London, 2012; S. Impedovo, L. Ottaviano, and S. Occhinegro, “Optical Character Recognition—A Survey,” Int. J. Pattern Recognit. Artif. Intell., vol. 05, no. 01n02, pp. 1-24, June 1991; Q. Tian, P. Zhang, T. Alexander, and Y. Kim, “Survey: Omnifont printed character recognition,” Vis. Commun. Image Process Image Process, pp. 260-268, 1991; J. Trenkle, A. Gillies, E. Erlandson, S. Schlosser, and S. Cavin, “Advances in Arabic text recognition,” in Proc. Symp. Document Image Understanding Technology, 2001; M. S. Khorsheed, “Off-line Arabic character recognition—a review,” Pattern Anal. Appl., vol. 5, no. 1, pp. 31-45, 2002; N. Arica and F. T. Yarman-Vural, “An overview of character recognition focused on off-line handwriting,” IEEE Trans. Syst. Man Cybern. Part C (Applications Rev., vol. 31, no. 2, pp. 216-233, May 2001; A. Amin, “Off-line Arabic character recognition: the state of the art,” Pattern Recognit., vol. 31, no. 5, pp. 517-530, March 1998—incorporated herein by reference in their entireties].
Bazzi et al. [I. Bazzi, R. Schwartz, and J. Makhoul, “An omnifont open-vocabulary OCR system for English and Arabic,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 6, pp. 495-504, June 1999—incorporated herein by reference in its entirety] presented work on omnifont text recognition for English and Arabic. The text recognition system was adapted from their HMM-based speech recognition system. Bakis topology was used with the same number of states for all of the models. Each Arabic character shape was modeled with a separate HMM. Additionally, six more models were added for six common ligatures appearing in printed Arabic text. A careful distribution of training data was proposed based on different styles (e.g. bold, italics) so that the recognizer would not be biased towards the dominant style of the training data. The results for polyfont recognition were below the average result for monofont recognition, which is expected. No special treatment for polyfont text recognition was proposed, apart from training the recognizer on text images from multiple fonts so that the model could generalize to a certain degree.
Khorsheed presented a discrete HMM-based system for printed Arabic text recognition [M. S. Khorsheed, “Offline recognition of omnifont Arabic text using the HMM ToolKit (HTK),” Pattern Recognit. Lett., vol. 28, no. 12, pp. 1563-1571, September 2007—incorporated herein by reference in its entirety]. The sliding window was divided into a number of cells vertically. Pixel density features were calculated from each cell of a sliding window and concatenated as a feature vector. These features were later discretized. Most of the characteristics of the system are similar to [I. Bazzi, R. Schwartz, and J. Makhoul, “An omnifont open-vocabulary OCR system for English and Arabic,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 21, no. 6, pp. 495-504, June 1999—incorporated herein by reference in its entirety], apart from the fact that the system was based on discrete HMMs. Experiments were conducted on a database of six different fonts. Again, no special treatment was proposed for polyfont text recognition.
Natarajan et al. [P. Natrajan, Z. Lu, R. Schwartz, I. Bazzi, and J. Makhoul, “Multilingual Machine Printed OCR,” Int. J. Pattern Recognit. Artif. Intell., vol. 15, no. 01, pp. 43-63, February 2001—incorporated herein by reference in its entirety] presented a HMM-based OCR system for multiple scripts. Most of the system components were adapted from the speech recognition system with the distinction of feature extraction. Pixel percentile features were presented as a novelty. The features were, to a large extent, robust to image noise. Pixels accumulated from top to bottom of a sliding window frame. Image height at a certain pixel percentile was taken as a feature. Values at twenty equally-separated pixel percentiles (from 0 to 100) were appended to form a feature vector. Horizontal and vertical derivatives of the features were also appended to the feature vector. In addition, angle and correlation features were computed from ten cells of a window frame (a window frame was divided into ten overlapping cells from top to bottom). The effectiveness of the features and the overall OCR system were demonstrated by recognizing text from three different scripts—English, Arabic, and Chinese. Unsupervised HMM adaptation was used for text recognition of documents with fax-related degradation.
Prasad et al. [R. Prasad, S. Saleem, M. Kamali, R. Meermeier, and P. Natarajan, “Improvements in hidden Markov model based Arabic OCR,” in 2008 19th International Conference on Pattern Recognition, 2008, pp. 1-4—incorporated herein by reference in its entirety] presented some improvements to the Arabic OCR system of the BBN group. The use of Parts of Arabic Word (PAW) language models were presented, which showed better performance in terms of recognition rates over word or character language models. Position-dependent HMM models, where every character shape of Arabic is treated as a separate HMM, were compared with position-independent models where each Arabic character had only one model. In addition, contextual tri-character HMMs were also tested. Results showed using a position-dependent HMM modeling strategy gives better results as compared to position-independent HMMs. However, contextual modeling along with position-dependent HMMs did not lead to improvements and actually lowered the recognition rates. Contextual HMMs for the position-independent approach does improve the results when compared to a simple position-independent modeling approach, which can be expected. Thus, it appears the use of position-independent HMMs may be enough to capture the contextual variations in printed Arabic text recognition. The work did not report any special strategy to deal with text recognition in multiple fonts.
Al-Muhtaseb et al. proposed a hierarchical sliding window for printed Arabic text recognition [H. A. Al-Muhtaseb, S. A. Mahmoud, and R. S. Qahwaji, “Recognition of off-line printed Arabic text using Hidden Markov Models,” Signal Processing, vol. 88, no. 12, pp. 2902-2912, 2008—incorporated herein by reference in its entirety]. A window is divided into eight non-overlapping vertical segments. Eight features (count of ink-pixels) were extracted from the eight segments. Four additional features were computed from the eight features using virtual vertical sliding windows of one-fourth the height of the writing line. Three more features were calculated using a virtual vertical overlapping sliding window of one-half the writing line height with an overlap of one-fourth the writing line height. An additional feature was computed by summing the first eight features. These hierarchical windows resulted in features that had more weight in the center region of the writing line (baseline). These hierarchical windows resulted in very high recognition rates on synthesized data [H. A. Al-Muhtaseb, S. A. Mahmoud, and R. S. Qahwaji, “Recognition of off-line printed Arabic text using Hidden Markov Models,” Signal Processing, vol. 88, no. 12, pp. 2902-2912, 2008—incorporated herein by reference in its entirety]. Experiments on text line images extracted from scanned documents showed poor results [I. Ahmed, S. A. Mahmoud, and M. T. Parvez, “Printed Arabic Text Recognition,” in Guide to OCR for Arabic Scripts, V. Märgner and H. El Abed, Eds. Springer London, 2012, pp. 147-168—incorporated herein by reference in its entirety].
Slimane et al. [F. Slimane, O. Zayene, S. Kanoun, A. Alimi, J. Hennebert, and R. Ingold, “New features for complex Arabic fonts in cascading recognition system,” in Proc. of 21st International Conference on Pattern Recognition, 2012, pp. 738-741—incorporated herein by reference in its entirety] proposed some font specific features for complex Arabic fonts like DecoType Thuluth, DecoType Naskh, and Diwani Letters. These fonts are difficult due to their complex appearances and ligatures. A large number of features and some of the features common to all the fonts were proposed, while some other features are specific for each font. HMMs were used as the recognition engine. Good improvements were reported over the baseline system for all three fonts. The system was evaluated on an APTI database of printed Arabic text on multiple fonts with low resolution and different degradation condition [F. Slimane, R. Ingold, S. Kanoun, A. M. Alimi, and J. Hennebert, “A New Arabic Printed Text Image Database and Evaluation Protocols,” in 10th International Conference on Document Analysis and Recognition, 2009, pp. 946-950—incorporated herein by reference in its entirety]. The database was generated synthetically.
Ait-Mohand et al. [K. Ait-Mohand, T. Paquet, and N. Ragot, “Combining structure and parameter adaptation of HMMs for printed text recognition,” IEEE Trans. Pattern Anal. Mach. Intell., 2014—incorporated herein by reference in its entirety] presented a work on polyfont text recognition using HMMs. The main contribution of the work was related to HMM model length adaptation techniques integrated with HMM data adaptation techniques, such as MLLR and MAP. The proposed techniques were effective in polyfont text recognition tasks, and significant improvements were reported by using this technique over the traditionally used HMM adaptation, which only addresses the data part of HMM. The two main limitations of the work, as pointed out by the authors are the need for a small amount of labeled data for the test font, and the assumption that the test line images will be from only a single font.