The following relates to the information arts. It particularly relates to double-view categorization of documents using two sets or systems of categories, and will be described with particular reference thereto. It relates more generally to categorization of objects such as documents, images, video, audio, or so forth, using two or more sets of categories.
Categorization is the process of assigning a given object to one or more pre-defined categories. Automatic categorization methods usually start with a training set of categorized objects, and infer therefrom a categorization model used to categorize new objects. More generally, categorization can be performed with respect to two or more sets of categories. Each set of categories defines a categorization dimension (also called a categorization view). The specific case of double-view categorization employs two categorization dimensions. More generally, three or more categorization dimensions can be employed.
The training set of objects is used to optimize parameters of a categorization model associated with each categorization dimension or view. Subsequently, the optimized categorization models are used to categorize new and uncategorized objects. In this approach, the categories of the various categorization dimensions are usually assumed to be statistically independent. However, there may be interdependency between categories of different categorization dimensions.
Interdependencies can be incorporated into multi-dimensional or multiple-view categorization by defining a combination view that explicitly combines the categories of different dimensions or views, and developing a complex categorization model for the combination view. For example, in the case of a double-view document categorization employing a “topic” categorization dimension (e.g., “art”, “music”, “science”, and so forth), and a “language” categorization dimension (e.g., “French”, “English”, “German”, and so forth), these two categorization dimensions are readily combined to define a “topic/language” categorization dimension having categories such as: “art/French”, “art/English”, “art/German”, “music/French”, “music/English”, “music/German”, “science/French”, “science/English”, “science/German”, and so forth. In this approach the combined “topic/language” dimension has a large number of categories. For example, if the “topic” categorization dimension includes 15 categories and the “language” categorization dimension includes 20 categories, the combined “topic/language” categorization dimension includes 15×20=300 categories. This large number of categories results in a correspondingly complex categorization model for the combined “topic/language” view that is difficult to train.
Moreover, this complex model approach is inflexible. For example, if some documents are to be categorized by both topic and language while other documents are to be categorized only with respect to language, the complex “topic/language” categorization model inefficiently performs categorization by language only. A separate and distinct “language” categorization model can be constructed and trained for use in language-only categorization tasks, but this introduces inefficient redundancy. Conversely, if trained categorization models already exist for the component views (e.g., “topic” and “language”), these existing models are not readily usable for training the combined “topic/language” categorization model, again leading to inefficient redundancy.