1. Field of the Disclosure
The present disclosure relates generally to tools to manage electronic content and, more particularly, to methods to automatically select and extract relevant data among the optical character recognition returned strings in scanned printed documents having columnar data, including documents such as academic transcripts.
2. Description of the Related Art
To select and extract relevant data from a scanned document having data arranged in a table or in columns, such as an academic transcript, optical character recognition (OCR) must contain the proper information strings and their positions to accurately extract the information. A “string” is an ordered collection of characters, digits and punctuations. To this end, the printed documents are scanned, and optical character recognition is applied to return strings and the respective positions for all printed text.
OCR systems can be trained to recognize characters in any user-defined font—not just fonts that are created specifically for optical character recognition (OCR-A, OCR-B, MICR, SEMI). OCR systems can be taught to recognize a full character set in any font created for any language. Problems can arise when data in the document is presented in a table or columnar form having header information for each column Line and other graphic elements may interfere with recognition of the text. For multi-line or split-line header information, a header composed of two or more lines may be recognized as two separate elements rather than a single element. OCR may be incomplete or inaccurate due to dirt, different shades of printing, stamps and the like, which can result in the misapplication of the return strings, either without meaning and/or in unrecognized string positions. It would be beneficial to have a method for the automatic extraction of data from an OCR document that analyzes the strings in the OCR document for tabular or columnar information and selects the header information and the information corresponding to each header and assigns them to a table cell (i.e., both a labeled column and row) placed in a table returned to the user for review or made available to a calling application for further use. It would be a further advantage to allow the user to arrange the information in the returned table to meet the user's needs.