Two major components of document conversion are page decomposition and text recognition (OCR). Page decomposition identifies the overall layout of a document page, whereas OCR identifies the ASCII components found within the page components.
A page decomposition or segmentation module accepts an input document page, and processes it into its constituent parts, including text, tables, references, procedural data, and graphics. Because page decomposition occurs first in the processing chain, it is one of the most important modules of any automatic document understanding system. It is very difficult to recover from any errors that occur at this stage of processing, and as such it is important that a very reliable page decomposition module be developed.
Most page segmentation methods can be classified into three broad categories: bottom-up, top-down and hybrid. The bottom-up strategies usually begin with the connected components of the image and merge them into larger and larger regions. The components are merged into words, words into lines, lines into columns, etc., until the entire page is completely assembled. In the top-down approaches, the page is first split into blocks, and these blocks are identified and subdivided appropriately, often using projection profiles. The hybrid methods combine aspects of both the top-down and bottom-up approaches. The page can be roughly segmented by a sequence of horizontal and vertical projections and then connectivity analysis is used to complete the segmentation.
Existing document conversion techniques include: 1) OCR (optical character recognition) document conversion, 2) form-based OCR, and 3) combined image and OCR systems. The first technique understands and translates the document into its individual components. The second utilizes spatial and content constraints of document forms to increase OCR reliability. The third leaves the document in its image form, but also utilizes OCRed text to provide indexing into the image database.
An OCR document conversion system which translates document images into ASCII and graphics components is illustrated in FIG. 1. The approach begins with page decomposition, wherein each page is segmented into graphic and text regions. Algorithms to process these regions are incorporated into the system. Graphical operations include extracting text from within graphics and raster-to-vector conversion. OCR is computed for the text regions, as well as any text located within the graphical regions. An integration step combines the results of the graphical and textual processes into a final electronic format.
The ASCII and graphics format obtained in this system is the only record of the document which is stored, as no images of the original document are kept. Since any errors introduced during page decomposition are propagated forward in the system, it is important that the graphical and textual regions of the page be correctly identified. Thus, the automatic regions generated must be manually checked or the page must be manually separated into regions. It is also important that the text be correctly recognized to ensure that the document is correctly captured. This requires that the OCR results be manually scanned and any errors corrected. Similarly, the text within graphics must be correctly extracted and recognized, again requiring manual checking. This manual checking can involve large cost which is often prohibitive for many applications.
Form-based OCR systems use the spatial and content constraints found in document forms to increase OCR conversion performance. According to this approach, a "form" may be defined as a document containing data written in fields that are spatially stable on the document. All form processing systems require the creation of a template showing the system what a particular form looks like and where to find the fields to read.
Form identification decides which master form was used in a given image, and passes that information to the form removal functions. Form removal strips all standard data from the scanned forms, including lines, instructions and examples, leaving only the information entered into the form by the applicant. The space necessary to store images is reduced by allowing users to save stripped forms and later add back the master form data before displaying or printing.
Full function form processors, in addition to machine print, also read hand printed characters and optical marks such as checkboxes. Form processing systems offer a broad array of tools, some standardized and some custom, which exploit information specific to a particular form type for error detection or correction. Restrictions can be applied to fields to increase recognition accuracy. For example, the field masking tool permits individual characters within a field to be recognized exclusively as either an alpha character or a numeric character. Thus, a social security number field can be required to contain only nine numeric characters and possibly two dashes. Segmentation of the field is then limited to 9 or 11 characters and recognition is limited to 10 digits and 1 dash. Additional tools include external table lookups, where field contents can be compared to a table of possible responses and the best match selected; checks of digit computations, where errors are detected by performing a mathematical operation on a field's contents and comparing the result to a predetermined total contained in the same field; and range limits, where the number in the field is checked to determine if it is within a valid range.
Unlike generic OCR systems, which only allow the error tolerance level to be set globally, form processing systems permit error tolerance levels (confidence levels) to be set on a field level. Thus, important fields, where field importance differs based on the form, can have their confidence levels set to a threshold requiring higher OCR accuracy.
Combined OCR and image systems utilize both data sources to provide robust retrieval, as illustrated in FIG. 2. The approach does not convert the document page directly to ASCII and graphics format, but rather saves a bit-mapped image of the document page and devises indexing schemes to retrieve specific pages. When a page is retrieved operations can then be performed on the areas of interest. For example, OCR can be computed on selected areas, or graphics can be extracted.
Once the text regions of a document have been identified, the characters within the regions need to be recognized. There are many different approaches to character recognition, but they can be generally grouped into two main categories: template-based methods and feature-based methods. Template methods maintain a collection of sample letters and identify a component in question by finding the closest-matching template. Feature methods, on the other hand, try to break the component into a collection of "features" by identifying where strokes join or curve significantly.
The classic template solutions compare each component to a collection of models representing all possible letters in all possible fonts. Thus, templates must be created for each of the different fonts. Contrarily, feature based recognition algorithms need not be tuned to individual typefaces, because they are based on finding characteristic features of each letter. For example, regardless of the typeface, a lowercase "t" consists of a strong vertical stroke crossed with a horizontal stroke. Thus, the feature based methods attempt to find this essence of the letter.
Each of the OCR techniques has its benefits and shortcomings. Combining the various methods in a voting scheme can overcome the limitations of each of the individual methods. In a voting scheme, the results of each of the OCR modules are passed to a decision module to determine a final recognition result. Since the decision module has knowledge about each of the OCR modules, it can determine the best possible answer.
The decision module can keep track of the character results and which OCR methods presented the correct response to the decision module. For example, if three methods report that the input character is a "B" and one method decides the character is an "8", the decision module will likely choose "B" as the best result. Further, the module that made the mistake will be noted for the next time. This adaptive learning approach allows the system to learn from its mistakes.
It is important to note that voting systems perform best when the hypotheses from the OCR systems are of high accuracy. When a text region is degraded and difficult to read, there is usually much disagreement among the recognizers, which is difficult for a voting system to resolve.
Each year, the Information Science Research Institute (ISRI) at the University of Nevada, Las Vegas (UNLV) conducts a test of the performance of various OCR systems, many of which are commercially available. Although recently tested OCR systems do not quite reach 100%, current recognition rates are impressive and improvement is ongoing. Achieving the last few percent is always the most difficult part, but OCR developers are steadily increasing their performance. With the incorporation of a voting scheme, the recognition rates increase even more.
If the OCR generated text is to be used in a text retrieval application, the percentage of words correctly recognized by the OCR system is of considerable interest. In a text retrieval system, the documents are retrieved from a database by matching search terms with words in the document. Thus, the word accuracy of the OCR-generated text is very important. Common words, such as "and," "of," "the," etc., usually provided no retrieval value in an indexing system. These words are termed stop words, and all other words are termed non-stop words. It is the recognition rate of these non-stop words that is of greatest importance to text retrieval applications.
If OCR is to be used as a conversion process to input technical manuals into ASCII form, manual checking and correcting of the OCR of the text will be necessary. Assuming an OCR character accuracy rate of 99%, a page with 4000 characters would result in 40 character errors per page. The issue becomes the cost of this manual correction versus the effectiveness of OCR, i.e. is it cheaper to correct the OCRed version or simply retype it?
To answer this question, we conducted a test using the OmniPage Professional OCR product to determine the time needed to correct an OCRed document versus the time needed to retype the document. It was assumed that the documents can be scanned and OCRed in a batch mode with the results saved to a file for future manual correction. Thus, only the actual labor costs are measured, not any time spent scanning and recognizing the document.
Seven pages from various documents were scanned and OCRed. The pages chosen for this test were quite simple, but included different fonts, and bold and italic characters. They contained single columns of text, no graphics and very few underlined sentences, since underlines tended to present a problem to the OmniPage recognizer. A bibliography page was also included in the set to introduce digits (from dates and page numbers) and proper nouns (author's names) which cannot be automatically corrected by dictionary look-ups.
OmniPage offers a method to check its OCR results. Any characters that the system has a difficult time recognizing are highlighted and the original image of the word in the context of the original page is presented to the user for possible correction. This method does not flag all OCR errors and presents numerous correct characters for viewing. Thus, this process was not incorporated in our timing test. The person correcting the text did not use this feature of the OmniPage system, but was allowed to use Microsoft Word spell checker to flag possible misspellings for correction.
Each OCRed page was manually corrected, and the correction time recorded. It took approximately 56 minutes to correct the seven pages. Assuming a typist can type 50 wpm, the time to retype these pages is 74 minutes or about 62 minutes at 60 wpm. From these numbers, it appears that OCRing the documents may be slightly more beneficial. However, a closer review of the manual corrections is needed.
The typical OCR errors include character omissions, additions and substitutions, bold and italic typeface errors, and incorrect spacing. Most of the missed errors were words recognized as bold typeface which were not bold in the original documents. The bibliography page (page five in the tables above) proved to be quite a challenge for the OCR system with fifty detected errors. The page was included because of its intermingling of digits and characters, and its inclusion of proper names and acronyms. This type of text must be carefully reviewed for errors. It is not like regular paragraphs where the corrector can simply read the flow of the sentences to check it. Dates, page numbers, and author's names must be carefully checked. Indeed, although the time required for retyping and manually correcting the pages in our test set were similar, the manual correction stage still left many errors uncorrected. Depending on the accuracy required, each page may need to be corrected by more than one person, thus doubling the time of manual correction.
The results of this experiment confirmed that OCR technology cannot be used to convert documents (either automatically or semi-automatically) in a cost-effective manner. More cost-effective methods are desperately needed, however, to convert existing large-scale, paper-document data bases into electronic form. Within the U.S. Government community, for example, reauthoring technical manuals into hypertext format costs between $200 and $1500 per page.
The use of hypertext documents has proven as a costeffective tool for supporting military equipment maintenance through the Department of Defense (DoD) Computer-aided Acquisition and Logistic Support (CALS) program. In this program, a hypertext format (IETM) was used for storing textual, graphical, audio, or video data in a revisable database. The IETM form enables the electronic data user to locate information easily, and to present it faster, more comprehensibly, more specifically matched to the configuration, and in a form that requires much less storage than paper. Power troubleshooting procedures not possible with paper Technical Manuals are possible using the computational capability of the IETM Display Device.
At the center of the IETM concept is the Interactive Electronic Technical Manual DataBase (IETMDB). This data structure is constructed from composite nodes which form the basic units of information within the IETMDB. These nodes are comprised of primitives, relationships to other pieces of information, and context attributes. The primitives include text, tables, graphics, and dialogs. The IETMDB is "format-free" in that it does not contain presentation information. As such, it does not impose structural requirements on the actual Data Base Management System (DBMS) methodology in use.
In summary, a hypertext-based approach to document conversion has potential for large-scale projects. However, in order to serve a greater technical and digital library community, existing hypertext approaches will need to be extended, to include more general encoding, revising, and distribution capabilities applicable to electronic technical data and documents.