Categorization and clustering have been two fundamental approaches to information organization and information database content management.
Categorization or classification is supervised in nature. A user defines a fixed number of classes or categories. The task is to assign a pattern or object to one or more of the classes. Categorization provides good control in the sense that it organizes the information according to the structure defined by the user. However, due to the predefined structure, categorization is not well suited to handling novel data. In addition, much effort is needed to build a categorization system. It is necessary to specify classification knowledge in terms of classification rules or keywords (disclosed in U.S. Pat. No. 5,371,807) or to construct a categorization system through some supervised learning algorithms (disclosed in U.S. Pat. Nos. 5,671,333 and 5,675,710). The former requires knowledge specification (e.g., written classification rules) and the latter requires example annotations (i.e. labeling information). Both are labor intensive.
Clustering is unsupervised in nature. For unsupervised systems (U.S. Pat. Nos. 5,857,179 and 5,787,420), there is no need to train or construct a classifier since information is organized automatically into groups based on similarities. However, a user has very little control over how the information is grouped together. Although it is possible to fine tune the parameters of the similarity measures to control the degree of coarseness, the effect of changing a parameter cannot be predicted; changing one parameter could affect all clustered results. In addition, the structure established through the clustering process is unpredictable. Whereas clustering is acceptable for a pool of relatively static information, in situations where new information is received every day, information with similar content may be grouped (based on different themes) into different clusters on different days. This ever-changing cluster structure is highly undesirable for the user who is navigating the information database to find desired information. Imagine the frustration of reading a newspaper with a different layout every day! U.S. Pat. No. 5,911,140 attempts to provide a solution by ordering document clusters based on user interests. However, the cluster ranking relies on the availability of the ranking of each document in the clusters and only very minimal user preferences are taken into account.