The present invention relates generally to document processing. Specifically a new method is taught using document layout data to compare, and/or classify documents by type.
In various applications it is desirable to classify documents by their type, e.g., business letter, article, fax cover sheet etc. Obviously, documents can be classified as belonging to any identifiable type. Applications calling for document classification include database management and document routing. Prior art methods of document classification involve three general steps before document types can be identified. First each document is preprocessed, which often includes page segmentation. Page segmentation is the process by which neighboring characters in the document are grouped into blocks of text and white space. Second, one of a variety of known optical character recognition (xe2x80x9cOCRxe2x80x9d) methods is applied to part of, or the entire, document. Finally, keywords are sought out from each document, which reflect the document type from which the words were extracted.
The prior art classification processes are relatively inefficient because they require substantial OCR, which is costly in resources (memory, computational load and time). This is especially true where only a relatively small portion of the document is required to identify its type. Moreover, it may be that the user requires OCR only for a particular document type, yet OCR must be applied to all of the documents in a database to determine the set of documents characterized by the desired type. Accordingly, it would be advantageous to have a method for comparing and classifying document types without requiring OCR.
In accordance with the present invention, a new method uses document layout information to classify a document type. Documents are first processed with a page segmentation method to obtain blocks of data. A grid of rows and columns, forming bins, is created on the page to intersect the blocks. Layout information is identified using a unique fixed length vector scheme, referred to herein as interval encoding, to represent each row on the segmented document. Using this new vector scheme, documents can be compared using a warping function to compute the relative interval distances of two or more documents. In addition, documents stored in a database may be retrieved, deleted, or otherwise managed by type, using their corresponding vector sets.
Documents may also be classified by type using an extension of the foregoing layout scheme without requiring OCR. In this embodiment are arbitrary number of clusters are formed for grouping interval encoding vectors. Each cluster is identified with a cluster center vector which relates to the interval encoding vectors of that group. A document to be processed in accordance with the present invention, is first, as with document comparison, segmented into data blocks. Interval encoding in accordance with the present invention is then performed on the segmented document. Thereafter, each interval encoding vector in the document is replaced with the cluster center for the cluster to which it belongs.
All desirable document types for classification are modeled based on a Hidden Markov Model (xe2x80x9cHMMxe2x80x9d). Using known algorithms, new documents are compared with the document type models to classify all documents by the model types.
Furthermore, based on the classification, it is a simple matter to locate which blocks of data contain certain information. Where only that information is desired, it is not necessary to perform OCR on the entire document. Rather OCR may be limited to those blocks where the particular information is expected based on the document type. For example, suppose it was desired to organize all business letters by addressee. A business letter has a predictable format with the addressee information found in the left upper third of the first Page of the document. Once the document is identified by layout to be a business letter, it is an easy matter to then examine only the left upper third of the document to recognize the addressee. There is no need to perform character recognition on the entire document before identifying the addressee.