1. Field of the Invention
The present invention generally relates to a document classification system and method and, more particularly, to a document classification system and method for classifying a document based on contents of the document. The present invention also relates to a processor readable medium storing a program code for causing a computer to perform the document classification method.
2. Description of the Related Art
Japanese Laid-Open Patent Application No. 7-36897 discloses a document classification system for automatically classifying a document in accordance with a document vector which represents a feature of words contained in the document. The classification is performed by grouping the document vector according to a clustering method.
Generally, document data is created from a document so as to register the document in a database. Generally, attribute data such as information regarding the date of draft and an author's name is added to the document data indicating the contents of the document. Additionally, in many cases, the document itself contains a plurality of items. That is, for example, a patent publication contains a plurality of items including “claims”, “description of prior art”, “summary of the invention” or “detailed description of the preferred embodiment”.
According to the document classification system disclosed in the above-mentioned Japanese Laid-Open Patent Application No. 7-36897, if document data includes a plurality of items, one of the items which is of a particular interest cannot be designated. Accordingly, the document data may include data which may provide undesirable influence to the classification of document. Additionally, data effective for classification of document may be insufficient since a plurality of items cannot be combined or designated. Thus, there is a problem in that an accurate result of classification cannot be obtained from document data.
Additionally, in recent years, a large amount of document information has become accessible since Internet has become popular. This allows a user of the Internet to perform an intellectual work such as classification of a large amount of document information into some categories and analysis of a structure of the classified documents.
If classification of a large amount of document information is done by operator's manual work, it requires an extremely large cost with respect to time and labor. Additionally, since classification is done based on only the knowledge of an individual operator, criteria of classification may vary operator to operator.
Accordingly, it is a very important issue as to how to automatically classify a document by a computer according to classification criteria normally achieved by a human work. More specifically, it is desirous to develop a document classification system that classifies documents having similar contents or meanings into the same category and each category defined in the classification process is one which is similar to the category intended by an operator before performing the classification.
According to the document classification system disclosed in the above-mentioned Japanese Laid-Open Patent Application No. 7-36897, classification is performed by using the document vector which is defined by words contained in a document. Accordingly, there is a problem in that a true content of the document cannot always be represented by the document vector due to synonymity and polysemy of certain words. Specifically, meanings of some words must be judged in relation to other words in the document or contents of the document, and such judgement requires complex processes.
In order to solve such a problem related to synonymity and polysemy, U.S. Pat. No. 4,839,853 suggests a method in which a singular value decomposition is applied to a matrix of an inner product between documents. That is, a document search in which a relationship in meanings is reflected is performed by projecting a document and a word onto a space referred to as a latent semantic index produced based on simultaneity of the document and the word.
Additionally, “Projection for Efficient Document Clustering” by Hinrich Schutze and Craing Silverstein, Proceedings of SIGIR 1997, pp 74–81, suggests document classification in the above-mentioned latent semantic index. Further, “Representing Documents Using an Explicit Model of Their Similarities” by Brian T. Bartell, Garrison W. Cottrell and Richard K. belew, Journal of the American Society for Information Science, 1995, vol. 46, No. 4, pp 254–271, teaches generalization of a method of transformation into the above-mentioned latent semantic index. A matrix used for calculating a transforming function is a sum of an inner product between documents and a matrix produced from cross-reference information of other documents. A representation transforming function, which is used for projecting a document or a word onto a space in which similarity of the documents is reflected, is produced by using such a matrix.
Each dimension of the projection space produced by the above-mentioned conventional method is a conceptual dimension defined by a plurality of words being connected with respect to the meanings thereof. A determination as to which feature dimension should be used to classify a document or search a document is performed based on only a magnitude of a singular value calculated when a singular value decomposition is applied. Accordingly, it is difficult to reflect operator's intention in the selection of the feature dimension used for classification. Thus, there is a problem in that a result of classification is different from the expectation of the operator.
Additionally, according to other conventional document classification methods, in order to perform document classification which reflects relationship between documents with respect to meanings thereof, a process for calculating a representation transforming function for transforming a document and a process for classifying the document transformed by the representation transforming function are continuously performed. However, there is a problem in that the process for calculating the representation transforming function takes a long time, and, as a result, the document classification takes an extremely long time.