1. Field of the Invention
The present invention relates generally to document classification systems and more particularly to a method of quickly and automatically classifying a new document by comparison against a number of documents of known type.
2. Description of the Related Art
As the number of documents being digitally captured and distributed in electronic form increases, there is a growing need for techniques to quickly classify the purpose or intent of digitally captured documents.
At one time document classification was done manually. An operator would visually scan and sort the documents by document type. This process was tedious, time consuming, and expensive. As computers have become more commonplace, the quantity of new documents including on-line publications has increased greatly and the number of electronic document databases has grown almost as quickly. As the number of documents being digitally captured and distributed in electronic form increases, the old, manual methods of classifying documents are simply no longer practical.
A great deal of work on document classification and analysis has been done in the areas of document management systems and document recognition. Specifically, the areas of page decomposition and optical character recognition (OCR) are well developed in the art. Page decomposition involves automatically recognizing the organization of an electronic document. This usually includes determining the size, location, and organization of distinct portions of an electronic document. For example, a particular page of an electronic document may include data of various types including paragraphs of text, graphics, and spreadsheet data. The page decomposition would typically be able to automatically determine the size and location of each particular portion (perhaps by indicating a perimeter), as well as the type of data found in each portion. Some page decomposition software will go further than merely determining the type of data found in each portion, and will also determine format information within each portion. For example, the font, font size, and justification may be determined for a block containing text.
OCR involves converting a digital image of textual information into a form that can be processed as textual information. Since electronically captured documents are often simply optically scanned digital images of paper documents, page decomposition and OCR are often used together to gather information about the digital image and sometimes to create an electronic document that is easy to edit and manipulate with commonly available word processing and document publishing software. In addition, the textual information collected from the image through OCR is often used to allow documents to be searched based on their textual content.
There have also been a number of systems proposed which deal with classifying and extracting data from multiple document types, but many of these rely on some sort of identity string printed on the document itself. There are also systems available for automatically recognizing a new form as particular form out of a forms database based on the structure of lines on the form. These systems rely, however, on the fixed structure and scale of the documents involved. Finally, there are expert systems that have been designed using machine learning techniques to classify and extract data from diverse electronic documents. One such expert system is described in U.S. patent application Ser. No. 09/070,439 entitled xe2x80x9cAutomatic Extraction of Metadata Using a Neural Network, now U.S. Pat. No. 6,044,375.xe2x80x9d Machine learning techniques generally require a training phase which may demand a good deal of computational power. Therefore these classification systems may be made to operate much more efficiently to extract data from documents if the document type of a new document is known.
From the foregoing it will be apparent that there is still a need for a method to quickly and automatically compare a new document to a number of previously seen documents of known type to classify the new document as either belonging to a known type, or as belonging to a new type.
The invention provides a method of quickly and automatically comparing a new document to a number of previously seen documents and identifying the document type. The method of the invention begins by providing a plurality of document type distributions, each document type distribution describes layout characteristics of an independent document type and may include a plurality of data points. Each document type distribution includes data derived from at least one basis document signature. A basis document signature includes a plurality of data points which can be computed from an individual basis document. The data points may represent a low-resolution image of the basis document, a low-resolution representation of the document segmentation of the basis document, or some other similar representation of the basis document. The data derived from the at least one basis document signature may include a multiple representative statistic value such as a mean or median value of each of the data values across each of the at least one document signatures.
The next step is providing a new electronic document. Then a new document signature is created from the new electronic document. The new document signature describes the layout characteristics of the new electronic document and may include data defining pixels of a low-resolution image of the new electronic document, a low-resolution representation of the document segmentation of the new electronic document, or some other similar representation of the new electronic document.
Next, distances between the new document signature and each of the plurality of document type distributions are calculated. The distances may be calculated using distance measures known in the art, such as Euclidean distance, Mahalanobis distance, an algorithm based on a Bayesian framework for a Gaussian distribution, or other measures. Additionally, distance calculations may weight the value given each of a plurality of data points in the basis document signatures or the document type distributions based on the usefulness of that data point in distinguishing between the various document types or the reliability of that point in specifying a particular document type. The reliability of each of the plurality of data points may be calculated, for example, based on the ratio of the spread of that data point within all basis documents of that document type to a spread of that data point across all of the plurality of the basis documents.
Based on the distances calculated, at least one candidate document type for the new electronic document is selected from among the independent document types described by the plurality of document type distributions. The selection of the at least one candidate document type may include selecting a preselected fixed number of the independent document types. The preselected fixed number of independent document types may be those described by the preselected fixed number of the plurality of document type distributions calculated to have the preselected fixed number of shortest distances. Alternatively, the selection of the at least one candidate document type may include selecting the independent document types described by those of the plurality of document type distributions having calculated distances that are within a preselected threshold distance of a shortest of the distances calculated. Further, the selection algorithm of the at least one document type may declare that the new electronic document is of a new type.
In addition, the invention provides for a program storage medium readable by computer, tangibly embodying a program of instructions executable by the computer to perform the method steps described above. Other aspects and advantages of the invention will become apparent from the following detailed description, taken in conjunction with the accompanying drawings and the attached pseudo code listing, illustrating by way of example the principles of the invention.