1. Field of the Invention
The present invention relates to organizing documents retrieved from the world wide web (WWW) or intranets. In particular, the present invention relates to organizing such documents under a classification scheme for efficient access.
2. Discussion of the Related Art
Two approaches to document organization are clustering (i.e., non-supervised learning) and classification (i.e., supervised learning). The major difference between clustering and classification is that clustering does not rely on a training set but classification does.
In one clustering technique, documents are dynamically clustered based on similarity. However, such an approach suffers from several shortcomings. First, the classification accuracy depends heavily on the number of documents in the database. Second, choosing good labels for categories generated based on clustering is difficult since the labels selected may not be meaningful to the users. To choose good labels for generated categories, many techniques based on word frequency analysis have been proposed. In general, however, these techniques have not been found effective. Consequently, for navigation purpose, clustering techniques are inferior to manual classification and labeling.
Classification is a method for both organizing documents in a document database and facilitating navigation of such a document database. Existing classifiers, such as Library of Congress Classification (LCC), can be used to organize local collections of documents. However, LCC""s classification and category labels are usually too fine (e.g. six to seven levels) for organizing relatively smaller local collections of documents.
For client side document categorization, such as organizing bookmarks and electronic mail (xe2x80x9cemailsxe2x80x9d) for individual users, the clustering approach is mainly chosen because a large document set is not available at the client side to train the classifier. On the other hand, at the server side, since abundant training data are available, the classification approach is often chosen.
Using the clustering approach to organize client documents (e.g., bookmarks and emails) suffers from many shortcomings resulting from the small document set at the client side. A small document set can generate clusters of no statistical significance and thus, when a small number of documents is added, which is proportionally large to the document set, the clusters can be easily changed.
The present invention provides a method for providing, on the client side, a navigation tree using an external classifier. The method comprises a maintenance method including a method for merging a parent internal node and leaf nodes, and a method for splitting an internal node in a parent internal node. In one embodiment, each leaf node represents a document in the navigation tree and each internal node is associated with a label representing a category of classification of the child internal nodes and leaf nodes associated with the parent internal node.
According to one aspect of the invention, a document insertion method is provided which inserts a document into the navigation tree according to a classification obtained from an external classifier using keywords in the document. The method also provides a document deletion method for deleting a document from the navigation tree. The method for splitting an internal node of a parent internal node is invoked by the document insertion method when a predetermined criterion is met. Similarly, the method for merging a parent internal node is invoked by the document deletion method when another predetermined criterion is met.
In one embodiment, the document insertion method and the document deletion method each include a step tending to maintain a preferred breadth of an internal node of the navigation tree to a predetermined value xcex1, being a desired number of child internal nodes and leaf nodes of a parent internal node.
The method of splitting an internal node of a parent internal node assigns leaf nodes to a new internal node, such that the total number of internal nodes and leaf nodes of the parent node is kept at a minimum after splitting. The predetermined criterion is met in the method for splitting a parent internal node when the total number of leaf nodes and internal nodes associated with the parent internal node is greater the sum of a predetermined value xcex1 and a predetermined value xcex4split. The method selects a minimum number of internal nodes for splitting, subject to a constraint (xe2x80x9cconstrained minimumxe2x80x9d). In one embodiment, the constraint minimum applies when multiple internal nodes can be selected for splitting to result in the same net change in the total number of leaf nodes and internal nodes.
The method of merging a parent internal node assigns leaf nodes of an internal node to the parent internal node, such that the total number of internal nodes and leaf nodes of the parent node after merging is minimized. The predetermined criterion is met in the method of merging a parent internal node when the total number of leaf nodes and internal nodes associated with the parent internal node is less than the difference of a predetermined value xcex1 and a predetermined value xcex4merge. The method of merging a parent internal node selects a constrained maximum number of leaf nodes for merging. In one embodiment, the constraint minimum applies when multiple internal nodes can be selected for splitting to result in the same net change in the total number of leaf nodes and internal nodes.
In one embodiment, the predetermined value xcex1, the predetermined value xcex4split and the predetermined value xcex4merge are each user independently selectable.
In one embodiment, the method of the present invention assigns a document (i.e., a leaf node) to internal nodes according to an access frequency of the document. The internal node selected for each document is intended to minimize the number of steps necessary to reach a frequently accessed document.
The present invention is better understood upon consideration of the detailed description below and the accompanying drawings.