1. Field of the Invention
The present invention relates to a document classifying system consisting of a document data classifying system and a document classifying function building system. The document data classifying system has a retrieval means and a classification decision tree. The document classifying function building system has an extraction means, a setting means, an allocation means, etc., and generates keywords, classifications, etc., to the document data classifying system to determine the classification decision tree.
The document data classifying system according to the present invention can easily define the classification to which document data belongs when it is input. Further, the document classifying function building system according to the present invention is provided for automatically building the classification decision tree of the document data classifying system.
2. Description of the Related Art
Recently, document-type database and full-text type databases are widely utilized in various fields so that a large amount of document data or text data is electronically stored in these databases. Accordingly, users have desired development of a high speed retrieval system for accessing these databases. In the development of a high speed retrieval system, there are two major problems to be solved, i.e., how to extract keywords used for the classification of the document data and how to automatically classify the document data by using the extracted keywords.
Conventionally, various documents have disclosed methods of automatic extraction of keywords and methods of automatic classifying of document data, both by using a computer. For example, as a first document, Sugiyama et al., "An Automatical Extraction System for Index Words and its Analysis", a report given at the Information Processing Symposium, Dec. 2, 1989; and as a second document, Uchiyama, et al., "An Extraction Method for Important Keywords", a report given at Information Processing Data Base Symposium, Aug. 4, 1991, can be mentioned.
However, there are common problems in these two documents, in that both of these approaches depend on a so-called "language processing technique", for example, a processing technique for the Japanese language. For example, a dictionary and a thesaurus, both necessitated in the processing technique, must be built to be dependent on a human interaction. That is, the dictionary and thesaurus are manually operated to prepare the document data, and keywords are determined from the document data by sequentially describing the document data. In this case, the document data are separated into several word sequences (so-called, "separated description" of word sequences) to determine keywords. As an example of the separated description, there are "system", "system's design", "system's design and research", and "system's design, research, and development", etc. These word sequences are sequentially retrieved to determine the keyword.
Further, as an another method for determining the keyword in a conventional art, some words having a higher frequency of appearance are extracted to determine the keyword from the word sequence. Further, as still another method, unnecessary words which are not suitable as keywords (these keywords are determined by a user) are eliminated, and the remaining words are extracted as the keyword.
However, there are some problems in the former method using "appearance frequency". That is, words which are unsuitable as keywords, for example, the words "problem" and "influence", may be extracted from consideration as a keyword if these words have a high appearance frequency. On the contrary, in this method, important words that are suitable as a keyword, but have a low appearance frequency, may not be extracted as the keyword.
Further, there are some problems in the latter method using "elimination of unnecessary words". That is, in this method, since many words are simply separated without any consideration of the contents of the document data, a large amount of words are extracted as keywords.
As is obvious from the above explanations, in a conventional art, no document data classifying system has been disclosed for defining the classification of a document data and document classifying function building system for supporting the document data classifying system.
Accordingly, to solve the above problems, in the present invention, a document classifying system includes a document data classifying system and a document classifying function building system. The document data classifying system includes a retrieval means and a classification decision tree. The document classifying function building system includes an extraction means, a setting means, an allocation means, etc., and generates keywords, classifications, etc., to the document data classifying system to determine the classification decision tree in the document data classifying system.
The document data classifying system according to the present invention can easily define the classification to which document data belongs when it is input. Further, the document classifying function building system according to the present invention is provided for automatically building the classification decision tree of the document data classifying system.