1. Field of the Invention
The present invention generally relates to the field of conversion of paper documents to electronic data.
2. Related Art
As the number of documents being digitally captured and distributed in electronic form increases, there is a growing need for techniques to quickly classify the purpose or intent of digitally captured documents, protect the security of the content of the documents, efficiently display the content of the documents to different users and allows users of such a system to monitor the process.
At one time document classification was done manually. An operator would visually scan and sort the documents by document type. This process was tedious, time consuming, and expensive. As computers have become more commonplace, the quantity of new documents including on-line publications has increased greatly and the number of electronic document databases has grown almost as quickly. As the number of documents being digitally captured and distributed in electronic form increases, the old, manual methods of classifying documents are simply no longer practical. Similarly, the conversion of the information in paper documents is an in efficient process which often involves data entry operators transcribing directly from original documents to create keyed data.
A great deal of work on document classification and analysis has been done in the areas of document management systems and document recognition. Specifically, the areas of page decomposition and optical character recognition (OCR) are well developed in the art. Page decomposition involves automatically recognizing the organization of an electronic document. This usually includes determining the size, location, and organization of distinct portions of an electronic document. For example, a particular page of an electronic document may include data of various types including paragraphs of text, graphics, and spreadsheet data. The page decomposition would typically be able to automatically determine the size and location of each particular portion (perhaps by indicating a perimeter), as well as the type of data found in each portion. Some-page decomposition software will go further than merely determining the type of data found in each portion, and will also determine format information within each portion. For example, the font, font size, and justification may be determined for a block containing text.
OCR involves converting a digital image of textual information into a form that can be processed as textual information. Since electronically captured documents are often simply optically scanned digital images of paper documents, page decomposition and OCR are often used together to gather information about the digital image and sometimes to create an electronic document that is easy to edit and manipulate with commonly available word processing and document publishing software. In addition, the textual information collected from the image through OCR is often used to allow documents to be searched based on their textual content.
There have also been a number of systems proposed which deal with classifying and extracting data from multiple document types, but many of these rely on some sort of identity string printed on the document itself. There are also systems available for automatically recognizing a candidate form as an instance of a specific form contained within a forms database based on the structure of lines on the form. These systems rely, however, on the fixed structure and scale of the documents involved. Finally, there are expert systems that have been designed using machine learning techniques to classify and extract data from diverse electronic documents. One such expert system is described in U.S. patent application Ser. No. 09/070,439 entitled “Automatic Extraction of Metadata Using a Neural Network, now U.S. Pat. No. 6,044,375.” Machine learning techniques generally require a training phase which may demand a large amount of computational power. Therefore these classification systems may be made to operate much more efficiently to extract data from documents if the document type of a new document is known.
From the foregoing it will be apparent that there is still a need for a method to quickly and automatically compare a new document to a number of previously seen documents of known type to classify the new document as either belonging to a known type, or as belonging to a new type.