In data mining applications, it is often useful to identify categories to which data items within a database (or multiple databases) belong. Once the categories are identified, some type of quantification measure regarding data items in the various categories can be generated. Such quantification measure may be a simple count of data items or it may be the sum (or some other statistic) of some value associated with each data item. Conventional techniques for computing quantification measures associated with data items in one or more categories are not very accurate or efficient.
Often, the quantification is performed manually. In one example context, quantification is based on categorizations performed by customer support representatives when taking customer calls (where each call represents a data item or case that has to be categorized). However, manual categorizations and quantifications such as those performed by customer representatives or other personnel are usually inaccurate because the personnel are often not properly trained or incented to categorize data items correctly. Also, there may not be a complete list of categories available to such personnel, which often leads to mis-categorization of data items.
In some cases, quantification is based on a sample of cases in a data set, rather than an entire data set. It is assumed that the computed quantities in each category based on the sample apply proportionately to the remainder of the data set. However, such an assumption usually does not apply to other data sets, such as data sets for the next time period (e.g., next month, next year, etc.). Therefore, for each periodic data set, a new round of manual identification and quantification is performed, leading to further expense.
In some other cases, quantification may be based on outputs of automated categorizers. However, it is often difficult and expensive to develop, train, and maintain accurate conventional categorizers, especially when cases need to be categorized into one or more of a large set of categories.
Also, the computation of quantification measures may suffer from inaccurate identification of categories, which are often initially unknown or not very well known. There are typically two types of techniques to identify useful categories: manual techniques and automated techniques. If performed manually, categories are usually identified based on the experience or “gut feelings” of experts. The experts can look at a sample of data items and, based on this examination, identify the categories (e.g., problems associated with a product or products of a company). This type of manual identification of categories is relatively time consuming.
In other cases, there may be industry standard sets of categories that are useable to provide an initial set of categories. Alternatively, people (such as customers) can be asked to fill out surveys to enable identification of categories. However, the information that can be gathered from customers in a survey is usually limited, and customers often provide incomplete or inaccurate information.
Generally, manual identification of categories as conventionally done is often inaccurate and can be costly. Moreover, the list of categories that are manually created may be incomplete such that data items may be forced into a category that the data items do not really belong to.
Automated techniques of category identification often use a clustering process. Clustering is often inaccurate, as clustering algorithms tend to place every data item of a database into some cluster or other, even though some of the data items may not belong to the clusters. Also, clustering algorithms tend to place each data item into a single cluster, even though some data items may belong to multiple clusters. Also, the number of clusters usually must be specified ahead of time rather than discovered based on the content of the data items. Also, if multiple data sets are examined and clustering performed on each, usually there is no consistency between the clusters and thus no accurate mechanism is provided to compare the categories of different data sets. Also, clustering algorithms usually do not assign a meaningful semantic label to an automatically-discovered cluster.
Thus, quantification of data items in one or more categories is associated with at least two issues: (1) conventional quantification techniques are generally inaccurate and/or inefficient; and (2) computed quantification measures may not be very meaningful or accurate due to inaccurate identification of categories.