Methods of document classification typically rely solely on lexical features of a document. In the book entitled Foundations of Statistical Natural Language Processing, authors Manning and Schutze provide a comprehensive review of classification procedures for text documents. None of the methods cited therein use the physical characteristics of the source documents when classifying them. However, such physical information about the source documents can be very valuable in categorizing the documents correctly, particularly when the documents may be of disparate types and sizes. So, rather than simply relying on the lexical features of the documents, the present invention is designed to increase the accuracy of classification by using both the physical and lexical features of the document in its classification schema. Such an approach has not been found in the art.
U.S. Pat. No. 6,892,193 relates to a system that combines different modalities of features in multimedia items. Specifically, multimedia information (media items) from disparate information sources, such as visual information and a speech transcript, are processed for supervised and unsupervised machine learning of categorization techniques. The information from these disparate information sources is combined in a coherent fashion. However, the kinds of features that are used in this system for classification can not be used in classification of text-based documents and certainly do not include features relating to the physical characteristics of the information sources.
U.S. Pat. No. 7,233,708 describes a method for indexing and retrieving images using a Discrete Fourier Transformations associated with the pixels of a picture to find statistical values associated with the textural attributes of the pixels. This method also does not take into consideration lexical or physical aspects of the information source.
US Patent Publication No. 2005/0134935 describes a method for delineating document boundaries and identifying document types where the graphical information of each image is used by a machine learning algorithm to learn classification rules to predict the document or subdocument type of the image. The machine learning algorithm may learn classification rules for each image based on the textual information in the image obtained by optical character recognition. Additionally, the output of these two such classifiers may be combined to produce a single output score from them or combined into one feature space and one machine learning algorithm that uses all features simultaneously to construct document or subdocument classification rules. However, this system also does not use physical properties of the document to improve classification.
Accordingly, the advantages of using the physical properties of the document to aid in the classification of the document are not known in the art.