The present invention relates to a method for the automatic identification of scanned documents in an electronic document capture and storage system.
In electronic document scanning systems, a page of a document is illuminated then scanned using one of several existing techniques such as a charged coupled device (CCD) to create a digital image representing a matrix of the black and white points on the page by a matrix of 0's and 1's. This matrix is then transmitted to a digital computer where it can be processed, displayed, identified and stored. In some applications the complete digital image of the document is stored. In other applications the results of an additional process such as optical character recognition (OCR) permits the text contained on the document to be stored instead of the full image of the document.
In all cases, however, the document must be identified for later retrieval and this step currently requires manual intervention. A data input operator normally identifies and provides indexing information for each page of the document as it is entered into the system. Because it is a manual step, identifying and indexing documents comprise the largest operating expense of most electronic document capture and storage systems. This expense limits the benefits of such systems in many applications because of its high cost.
In a typical application, a data entry operator will enter the identification of each page of a document into the data base of a system after the page is scanned. This identification may be composed of as few as five or six numbers or characters, but often is of significantly greater length to properly identify the page and allow an efficient retrieval search of the document at a later time.
Document identification and indexing information usually consists of the document class (e.g., letter, invoice, credit application, etc.) and a combination of numeric and alphanumeric characters which uniquely identify this document within its class (e.g., name of the party sending the letter, vendor invoice number, etc.). Additionally, certain "key words" may be added to allow a group of documents within the class to be simultaneously retrieved (e.g., subject of the letter, type of merchandise ordered, etc).
To achieve such data entry in a reasonably efficient manner, documents are often prepared in advance for data entry operators by manual sorting, pre-encoding, highlighting key fields or other techniques to guide the flow of the data entry operation.
In applications with standardized documents such as processing Internal Revenue Service forms, automated processing for sorting and data entry is possible. In such cases OCR processing has been successfully used to identify documents created with well known fonts. However, the vast majority of documents processed by document capture and storage systems do not have such standardized formats or well known OCR readable fonts.
There is a need, therefore, for a method of automating the document identification and indexing steps of document capture and storage systems. The method must have sufficient flexibility to handle non-standardized document formats as well as varying types of printed fonts, images or handwritten characters.