1. Field of the Invention
The present invention is related to the field of document classification and, more particularly, to a method for automatic document classification based on a combined use of the projection and the distance of the differential document vectors to the differential latent semantics index (DLSI) spaces.
2. Description of the Related Art
Document classification is important not only in office document processing but also in implementing an efficient information retrieval system. The latter is gaining importance with the explosive use of distributed computer networks such as the Internet. Currently, even the most popular document classification tasks in YAHOO® are totally done by humans. However, this method is restricted by the realities of manual capacity to the classification of a limited number of documents and is wholly insufficient to process the almost limitless number of documents available in today's highly networked environment.
The vector space model is widely used in document classification, where each document is represented as a vector of terms. To represent a document by a document vector, weights are assigned to its components usually evaluating the frequency of occurrences of the corresponding terms. Then the standard pattern recognition and machine learning methods are employed for document classification.
In view of the inherent flexibility imbedded within any natural language, a staggering number of dimensions are required to represent the featuring space of any practical document comprising the huge number of terms used. If a speedy classification algorithm can be developed, the first problem to be resolved is the dimensionality reduction scheme enabling the documents' term projection onto a smaller subspace.
Basically there are two types of approaches for projecting documents or reducing the documents' dimensions. One is a local method, often referred to as truncation, where a number of “unimportant” or “irrelevant” terms are deleted from a document vector, the importance of a term being evaluated often by a weighting system based on its frequency of occurrences in the document. The method is called local because each document is projected into a different subspace but its effect is minimal in document vectors because the vectors are sparse. The other approach is called a global method, where the terms to be deleted are chosen first, ensuring that all the document vectors are projected into the same subspace with the same terms being deleted from each document. In the process, the global method loses some of the important features of adaptability to the unique characteristics of each document. Accordingly, a need exists for ways to improve this adaptability.
Like an eigen decomposition method extensively used in image processing and image recognition, the Latent Semantic Indexing (LSI) with Singular Value Decomposition (SVD) has proved to be a most efficient method for the dimensionality reduction scheme in document analysis and extraction, providing a powerful tool for the classifier when introduced into document retrieval with a good performance confirmed by empirical studies. A distinct advantage of LSI-based dimensionality reduction lies in the fact that among all the projections on all the possible space having the same dimensions, the projection of the set of document vectors on the LSI space has a lowest possible least-square distance to the original document vectors. This implies that the LSI finds an optimal solution to dimensional reduction. In addition to the role of dimensionality reduction, the LSI with SVD also is effective in offering a dampening effect of synonymy and polysemy problems with which a simple scheme of deleting terms cannot be expected to cope. Also known as a word sense disambiguation problem, the source of synonymy and polysemy problems can be traced to inherent characteristics of context sensitive grammar of any natural language. Having the two advantages, the LSI has been found to provide a most popular dimensional reduction tool.
The global projection scheme encounters a difficulty in adapting to the unique characteristics of each document and a method must be developed to improve an adverse performance of a document classifier due to this inability.