This invention relates generally to automatic optical recognition of two-dimensional image data. More specifically, it relates to an apparatus and method for recognizing and analyzing two-dimensional image data, such as typewritten or printed text, by converting the two-dimensional data into three-dimensional data.
The problem of recognizing written characters by machine has been investigated in great detail. Many digital library projects have focused on application of automating character recognition to create computer accessible collections of historic and cultural materials to support scholarship in the humanities. Once in digital form, the ability to store, enhance, search and interact with the content permits powerful analysis and comparison. The original written resources involved in these projects range from typewritten or printed text, to handwritten manuscripts that can vary dramatically in the form and structure of the written characters.
Most digital library projects involving handwritten manuscripts work with images and human translations of handwritten materials. Some previous projects, however, have attempted to digitize and automatically translate handwritten manuscript materials using optical character recognition (OCR) techniques to convert the materials to machine-readable form. These projects include those described by Rath, T. and Manmatha, R., “Features for Word Spotting in Historical Manuscripts,” Proceedings of ICDAR 2003 Conference, Vol. 1, 218-22 (2002); T. Theeramunkong, C. Wongtapan, S. Sinthupinyo, “Offline Isolated Handwritten Thai OCR Using Island Based Projection with N-Gram Models and Hidden Markov Models,” Proceedings of ICADL 2002, Vol. 2555, 340-51 (2002); and T. Theeramunkong, V. Sornlertlamvanich, T. Tanhermhong, W. Chinnan, “Character Cluster Based Thai Information Retrieval,” Proceedings of the Fifth International Workshop on Information Retrieval with Asian languages, Hong Kong, China, 75-80 (2000); as well as the Shuhai Wenyuan Classical Chinese Digital Database and Interactive Internet Worktable project at the University of Hawaii at Manoa.
Automated recognition to provide access to printed textual materials involves scanning and computer assisted character recognition, typically using an OCR program to identify the letter forms to “translate” from the graphic image output of the scanner. The general process of character recognition of handwritten manuscripts is based on two-dimensional image recognition or stochastic methods applied to the two dimensional image. These include, for example, neural networking and HMM (Hidden Markov Models). In all cases, the script (page, letters, words etc.) is represented as two-dimensional bitmaps. Various methods then try to infer the letter from this two-dimensional image.
The key to machine recognition of handwritten materials is the ability to differentiate between the ascenders, descenders, loops, curls, and endpoints that define the overall letter forms. Identification of contractions, abbreviations, and punctuation creates similar challenges. Development of the techniques to extract the features from a handwritten line of characters is a significant challenge for computer scientists and has resulted in techniques to extract, sequence, cluster, categorize, and compare features to attempt to recognize and assign meaning to a given character.
The accuracy of the OCR process is directly proportional to the quality, regularity and resolution of the letter forms. For mechanically replicated letter forms, such as typed or typeset materials, this process is relatively straightforward. For handwritten materials with variation in letterform, however, the problem of recognition is extremely complex because the letter forms can vary significantly throughout a document. Human intervention can be used to identify and prompt problem letter forms and assist in “training” the OCR programs, which can significantly improve accuracy as similar documents are scanned.
Even with human intervention, however, complete recognition is complicated dramatically for handwritten materials, such as manuscripts or musical scores, due to variables in the content. One of the variables in handwritten content is the letter forms that vary significantly throughout the document. This problem is compounded when the letter forms overlap or merge with adjacent characters. Consequently the accuracy rates for OCR plunge dramatically even with human intervention to correct and train the recognition software, and accuracy rates fall below perfect recognition. For handwritten materials the accuracy rate reaches 85-90% for closed tests, but open tests where “trained” OCR programs are used to recognize new text similar to the proposed project, current accuracy rates range from 60-75%.
FIGS. 1 and 2 show two examples of the type of complex handwritten materials that illustrate some of the complexities that present problems in the automatic recognition of such text. FIG. 1 shows an example of handwritten text, and FIG. 2 shows an example of a rubric from Spanish archives from the 17th century. These examples have been provided through the courtesy of the Hispanic Research Center at Arizona State University in Tempe, Ariz. The example of FIG. 1 shows the cursive writing with varying letter forms and overlapping characters that must be recognized automatically. In addition, these manuscripts include abbreviations, rubrics, and signatures that must be interpreted, such as the rubric shown in FIG. 2. These examples demonstrate some of the problems presented in attempting to automatically recognize handwritten text. For instance, the cursive writing style of the examples results in a lack of separation between characters, which makes automatic recognition more difficult. They also demonstrate variation of letter forms. Unlike type written or printed text, each character varies slightly in shape and size from identical characters elsewhere in the document even when created by the same individual. For example, the bottom loop of the letter “y” in FIG. 1 runs into the line below it. In addition, the examples show different writing styles. They are written by different people and include different styles of cursive writing. The examples also show rubrics and abbreviations, i.e., sprinkled throughout the manuscripts are graphic images that convey meaning and must be recognized and associated with meaning more complex than a single letterform. They also illustrate accents and omitted characters. Accents, notations and sometimes missing characters are inserted between the rows of characters. These features of handwritten manuscripts greatly increase the complexity of the problem of automatic recognition of such manuscripts.
It is an object of the present invention to provide an apparatus and method for addressing the problems presented in attempting to automatically identify complex two-dimensional image data, such as handwritten text, manuscripts, signatures and the like.
Additional objects and advantages of the invention will be set forth in the description that follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objects and advantages of the invention may be realized and obtained by means of the methods and apparatus pointed out in the appended claims.