It is often desirable to build predictive models based on categorical predictors and responses. Such models may involve large numbers of categories (levels, unique values, etc.). For example, to build predictive models in industrial applications, it is not uncommon to encounter a categorical-predictor attribute with possibly hundreds or thousands of categories. Examples of such categorical predictors are the lot identifier of product in semiconductor manufacturing, part ID, zip codes, email domains, etc. In addition, the response variable may also be categorical (with the number of levels greater than two). Typically, the data have small numbers of observations per category of the predictor. The goal of building a predictive model is an efficient, computationally fast way to discover value-groups (partitions) of such high-cardinality predictors. Such groups may be used directly to partition the categories of the predictor with similar responses or for input to further analyses such as decision trees, neural networks, support vector machines, discriminant analysis, etc.
Unfortunately, in existing systems, if there are large numbers of categories and both the predictor category and response category are non-metric, then large amounts of time and computer resources are typically required. Alternatively there may be limitations imposed on the level of analysis. For example, some existing systems enforce a binary partition by selecting one distinguished value of a categorical predictor as one group, and the rest of the values combined into another group. As a further example, CART (classification and regression trees) uses an exhaustive search on all possible two-way groupings to minimize a selected measure of impurity (e.g. cross-entropy measure or Gini index). CART has O(2n-1) complexity, where n is number of levels to be grouped. Many CART implementations (commercial and in academia) have restrictions on the number of levels of a categorical predictor (usually n=30).
Additionally, other algorithms used by current systems, such as agglomerative clustering, correspondence analysis, and systems using an χ2 based distance measure of the difference between rows also typically result in comparatively large numbers of computations and have O(x2) complexity.
In view of the above, there is a need in the art for the embodiments of the present invention.