1. Field of the Invention
The invention relates generally to information systems, and more particularly, the invention relates to a novel categorization system and method for a clustering framework.
2. Description of the Related Art
With the explosion of data volumes and Internet usage, many everyday applications rely on some form of automated categorization to avoid information overload and improve user experience. While matching of objects to predefined categories can often be carried out via supervised classification, there are many cases where classification is difficult due to inability to generate training examples for all categories. This is particularly true in situations where the pre-defined categories are highly dynamic due to constantly evolving applications and user needs. This makes it impossible to rely on manual labeling or sample selection, as they would have to be conducted whenever the taxonomy changes.
It would thus be highly desirable to provide a system tool and methodology to facilitate many types of categorizing applications. For instance, those types of categorization involving: 1) a dataset with underlying basic features attached to every point (any features that can be computed for each sample point in the data set); 2) a set of categories, each accompanied with category descriptions (could be just descriptive names); and/or, 3) optional data descriptions generated independently of the category descriptions (e.g. outdated labels or labels assigned without predefined taxonomy in mind).
Note that this set of assumptions captures well the characteristics of many real world applications. For example, in a stock photo database, images are organized according to predefined categories, such as “Portraits” or “Macro”, with corresponding descriptions. Selected image features can be computed for all photos in the collection, forming the basic feature set, while some photographers enter the description of their photos, which then represent optional data descriptions.
Another example, is in the area of business analytics, and relates to categorization problems often encountered in project management tools. Such tools are used to track projects according to a set of predefined categories aligned with products/services sold by the company, and compute business metrics that assess the quality of service. In order to obtain meaningful metrics, it is critical to ensure accurate assignment of projects into the pre-defined solution categories. However, because of the dynamic business environments and changing customer needs, the solution portfolios are constantly evolving and frequently redefined, limiting the ability of project managers to categorize projects accurately. Hence, there is a need for an automated methodology to assist with project categorization.
Moreover, while past research has proposed various methods to incorporate “constraint violation penalty” in clustering with pairwise constraints, these techniques are not used for cases using seeds with varying confidence levels in a k-means setting.
Thus, it would be highly desirable to provide a method for adaptive categorizing for use with partial textual descriptions and basic features in a constrained clustering framework.