Not Applicable
Not Applicable
Not Applicable
Copyright 1999 Computer Services, Inc. A portion of the disclosure of this patent document contains materials which are subject to copyright protection. The owner has no objection to the facsimile reproduction by anyone of the patent document or patent disclosure, as it appears in the Patent and Trademark Office patent file or records, but otherwise reserves all rights, copyright rights whatsoever.
1. Field Of The Invention
This invention generally relates to systems and methods for the extraction of data from digital images and more particularly, to a system and method for the extraction of textual data from digital images.
2. Background Information
Systems are known which import data from scanned paper documents. Typically, these systems identify by physical location a data field in a scanned image of a blank document. When the system scans documents conforming to that blank document type, the data field location information is used to identify the area in the scanned document where the corresponding data appears and that data is then converted from bit mapped image data to text data for storage in a database.
In U.S. Pat. No. 4,949,392 entitled xe2x80x9cDocument Recognition and Automatic Indexing for Optical Character Recognition,xe2x80x9d issued Aug. 14, 1990, preprinted lines appearing on the form are used to locate text data and then the pre printed lines are filtered out of the image prior to optical character recognition processing. In U.S. Pat. No. 5,140,650 entitled xe2x80x9cComputer Implemented Method for Automatic Extraction of Data from Printed Forms,xe2x80x9d issued Aug. 18, 1992, lines in the image of a scanned document or form are used to define a data mask based on pixel data which is then used to locate the text to be extracted. In U.S. Pat. No. 5,293,429 entitled xe2x80x9cSystem and Method for Automatically Classifying Heterogeneous Business Forms,xe2x80x9d issued Mar. 8, 1994, the system uses a definition of lines within a data form to identify fields where character image data exists. Blank forms are used to create a form dictionary which is used to identify areas in which character data may be extracted. In U.S. Pat. No. 5,416,849 entitled xe2x80x9cData Processing System and Method for Field Extraction of Scanned Images of Document Forms,xe2x80x9d issued May 16, 1995, the system of document image processing uses Cartesian coordinates to define data field location. In U.S. Pat. No. 5,815,595 entitled xe2x80x9cMethod and Apparatus for Identifying Text Fields and Checkboxes in Digitized Images,xe2x80x9d issued Sep. 29, 1998, a system locates data fields using graphic data such as lines. In U.S. Pat. No. 5,822,454 entitled xe2x80x9cSystem and Method for Automatic Page Registration and Automatic Zone Detection during Forms Processing,xe2x80x9d issued Oct. 13, 1998, the system uses positional coordinate data to identify areas within a scanned document for data extraction. In U.S. Pat. No. 5,841,905 entitled xe2x80x9cBusiness Form Image Identification Using Projected Profiles of Graphical Lines and Text String Lines,xe2x80x9d issued Nov. 24, 1998, the system uses cross-correlation of graphical image data to identify a form and the areas within the form for data extraction.
Each of these patents discloses a system which uses graphical data to identify forms or regions within a form for data extraction. By relying on graphical data to identify areas for data extraction, should additional or the wrong type of textual data be present in such areas that data will also be extracted and stored. It would be advantageous to be able to determine if the extracted data matches the type of data that is expected to be on the form. Also if the data were of the correct type but mispositioned somewhat with respect to its expected position on the document, it would be advantageous to be able to locate and extract such the mispositioned data. Further where data is on multiple pages, such as a two page phone bill, with the systems mentioned above, each page of the phone bill that looks different would have to be defined as a new template. It would be advantageous to have a system that can process data from multiple page forms without requiring additional preprocessing effort.
The present invention is a system and method for the extraction of textual data from digital images using a predefined pattern of visible and invisible characters contained in the textual data. The system comprises an image mapper, a template mapper, a zone optical character reader (OCR), a zone pattern comparator and data extractor, an extracted data parser and datastore. The datastore comprises a master document image database comprised of at least one table containing at least one master document image, a template database and an extracted data database. The template database comprises at least one table comprising at least one template associated with a master document image. The template has at least one zone and associated with each zone is a unique pattern comprised of one or more data segments. Each data segment comprises a predefined sequence of visible and invisible characters, with selected ones of the data segments being associated with an extracted data field in an extracted database record. The extracted data database comprises at least one table of extracted database records and each record comprises at least one data field for storing textual information extracted from the digital image.
The image comparator receives from the master document image database in the datastore a master document image for comparison with a digital image. The image comparator provides an output indicative of the success of the comparison. The template mapper, on receiving the image comparator output indicating a successful comparison, retrieves from the template database in the datastore the template associated with the successfully compared master document image and applies this template to the digital image. The template mapper provides as an output an image of each zone associate with the applied template. The zone optical character reader (OCR) receives the zone images and creates as an output a zone data file of the characters in each zone image. The zone pattern comparator receives from the template database the pattern associated with the zone and compares the pattern to the zone data file. In the event that the pattern is found, the data matching the pattern digital is extracted. The extracted data parser receives the extracted data and parses it based on the pattern and populates the data field of the database record associated with the digital image which is stored in the extracted data database.
The method for the extraction of textual data comprises:
a) selecting from a database a master document image having associated therewith a template, zone, and associated with each zone a pattern comprised of one or more data segments containing a data sequence of one or more characters;
b) creating an unpopulated database table having one or more data records, each data record having one or more data fields for containing visible character data extracted from the digital image and associating the database table with the master document image and the database record with the digital image, and, for at least one of the data segments containing visible data associating it with a database field;
c) comparing the digital image to the master document image and upon a successful match occurring:
applying the template and zone therein to the digital image,
performing optical character recognition on the character images within the zone,
creating a zone data file containing the characters optically read from the zone;
comparing the zone data file with the pattern associated with the zone;
extracting the data in the zone data file that matches the pattern, and, for each data segment associated with a data field, populating the data field with the visible data extracted from the zone data file corresponding to that data segment.
In an alternate embodiment, if the extracted data cannot be successfully matched, a validation file of the unmatched data is created for review by an operator. In a further embodiment, if the scanned digital image cannot be matched with an existing master document image, a new master document image can be created from the unmatched digital image. In another alternate embodiment, alternate patterns can be used to search the data files allowing for variation in format of the data being extracted.
There has thus been outlined, rather broadly, the more important features of the invention in order that the detailed description thereof that follows may be better understood, and in order that the present contribution to the art may be better appreciated. In the several figures where there are the same or similar elements, those elements will be designated with similar reference numerals. There are, of course, additional features of the invention that will be described hereinafter and which will form the subject matter of the claims appended hereto. In this respect, before explaining at least one embodiment of the invention in detail, it is to be understood that the invention is not limited in its application to the details of construction and to the arrangements of the components set forth in the following description or illustrated in the drawings. The invention is capable of other embodiments and of being practiced and carried out in various ways. Also, it is to be understood that the phraseology and terminology employed herein are for the purpose of description and should not be regarded as limiting. As such, those skilled in the art will appreciate that the conception, upon which this disclosure is based, may readily be utilized as a basis for the designing of other structures, methods and systems for carrying out the several purposes of the present invention. It is important, therefore, that the claims be regarded as including such equivalent constructions insofar as they do not depart from the spirit and scope of the present invention.
Further, the purpose of the foregoing abstract is to enable the U.S. Patent and Trademark Office and the public generally, and especially the scientists, engineers and practitioners in the art who are not familiar with patent or legal terms or phraseology, to determine quickly from a cursory inspection the nature and essence of the technical disclosure of the application. The abstract is neither intended to define the invention of the application, which is measured by the claims, nor is it intended to be limiting as to the scope of the invention in any way.