1. Field of the Invention
The present invention relates to a document organizing apparatus and a method thereof for organizing a large number of document files stored in an information processing unit corresponding to the contents thereof.
2. Description of the Related Art
As computer networks have become widely used, a large amount of online document information has flooded the networks. Thus, the network users are expecting services that allow them to effectively and conveniently search and organize document information. For example, Internet home page searching services are roughly categorized into the following two types of services and the combination thereof.
(a) Directory services
These services hierarchically categorize and organize home pages.
(b) Full text searching services
These services search full text of pages collected by a robot (a searching program).
In a well-known directory service, a directory is created in the following method.
1. The creator of a home page submits a desired URL (Uniform Resource Locator) to the service provider.
2. The service provider hierarchically categorizes the submitted URL as a particular category and registers the categorized URL.
3. The hierarchical categories are unique to the service provider. The hierarchical categories are frequently varied. In addition, each home page is categorized into a plurality of categories.
In this service, ten and more professionals (they are referred to as surfers) create a directory and maintain information so as to provide the users with high quality and up-to-date information. However, it is difficult to constantly employ a sufficient number of surfers. In addition, when the user categorizes a large amount of electronic mail, it is difficult to manually create a directory. Thus, an automatic document categorizing system using a computer is desired.
In taxonomy, information is categorized with a tree structure. At a branch of the tree, child nodes are independent of each other. In addition, cross-categories are not permitted. Each information piece is placed at one position in the tree structure.
When a document is searched by such a taxonomical method, the document is categorized corresponding to the tree structure. Thus, only one path is set to one document. However, the categorizing criterion of the user does not always match that of the taxonomist. Consequently, the user may have difficulty in reaching a desired document. Thus, such a method is not always effective.
To solve such a problem, when a document is searched corresponding to categories, a plurality of categories may be assigned to one document as with the directory structures of Internet directory services. In a related art reference disclosed in “Document Information Categorizing Method and Document Information Categorizing Apparatus (translated title)” (Japanese Patent Laid-Open Publication No. 8-153121), hierarchical categorizes are created with keywords of a group of documents and each document is registered in a plurality of categories.
The automatic document categorizing system has been studied in the two approaches that follow. These two approaches have advantages and disadvantages. Thus, it is necessary to select one of these approaches or combine them corresponding to an application for use.
(a) Clustering
A given group of documents is divided into several suitable classes corresponding to the statistical/apparent relation of keywords. An advantage of this approach is in that categorized results corresponding to the features of the original group of documents are obtained regardless of conventional categories. A disadvantage of the approach is in that the accuracy of the automatic categorization is low.
(b) Categorization
In this approach, it is determined into which of the conventional categories a given document is to be categorized. As the conventional categories, a thesaurus or the like is used. In this approach, corresponding to the distribution of a keyword in a document, the document is categorized into a suitable category. An advantage of this approach is that the accuracy of the automatic categorization is higher than that of the clustering approach. A disadvantage of this approach is that the categorized results are general and the features of the original group of documents are not reflected in the categorized results.
In most of the Internet directory services, documents are manually categorized into conventional categories. When one class becomes large, it is manually divided into clusters.
In the above-described related art reference (Japanese Patent Laid-Open Publication No. 8-153121), documents are clustered corresponding to a keyword added thereto. In addition, to compensate for a disadvantage of the clustering approach by using a keyword, a conventional thesaurus is used. Using statistics of semantic attributes, the categorizing accuracy is improved. This method has been disclosed by Atsuo Kawai, “Automatic Document Categorizing System corresponding to Learning Results of Semantic Attributes (translated title)”, Journal of Information Processing Society of Japan, Vol. 33, No. 9, pp. 1114–1122, 1992.
However, the above-described conventional document categorizing system has the following problems.
In a manual categorization, professionals who create and manage a directory are required. It is difficult for inexperienced users to categorize documents. When hypertexts of a directory are manually maintained, the labor work of the administrator becomes large. In addition, simple mistakes easily take place.
In addition, when documents are automatically categorized corresponding to taxonomy, each document is normally categorized into one category. Unless the categorizing criterion of the user matches that of the taxonomist, it is difficult for the user to reach the desired information. Moreover, in both the clustering approach and the categorization approach, documents cannot be fully automatically categorized. If there are unnecessary categories or if necessary categories are omitted, it is more difficult for the user to reach the desired information.
In addition, according to the paper of Kawai, the accuracy in the clustering approach is around 60%. In other words, the clustering approach is far from practical use. Moreover, in the categorization approach, since documents are categorized into general categories, they do not reflect the features of the original group of documents.