1. Field of the Invention
The invention relates to a system that incorporates an interactive graphical user interface for graphically visualizing clusters (specifically segments) of data. Specifically, the system automatically categorizes incoming case data into clusters, summarizes those clusters into segments, determines similarity measures for those particular segments and then forms and visually depicts hierarchical organizations of those segments. The system also compares two user-selected segments or segment groups together and graphically displays normalized scored comparison results. Additionally, the system also automatically and dynamically reduces, as necessary, a depth of the hierarchical organization (total number of hierarchical levels) based on scored similarity measures of the selected clusters; and, based on normalized scores, provides and displays a relative ranking of the displayed segments, as well as displays summarized characteristics of any such segment.
2. Description of the Prior Art
Computer systems have long been used for data analysis. For example, data may include demographics of users and web pages accessed by those users. A web master (i.e., a manager of a web site) may desire to review web page access patterns of those users in order to optimize links between various web pages or to customize advertisements to the demographics of the users. However, it may be very difficult for the web master to analyze the access patterns of thousands of users involving possibly hundreds of web pages. However, this difficulty may be lessened if the users can be categorized by common demographics and common web page access patterns. Two techniques of data categorizationxe2x80x94classification and clusteringxe2x80x94can be useful when analyzing large amounts of such data. These categorization techniques are used to categorize data represented as a collection of records, each containing values for various attributes. For example, each record may represent a user, and the attributes describe various characteristics of that user. The characteristics may include the sex, income, and age of the user, or web pages accessed by the user. FIG. 1A illustrates a collection of records organized as a table. Each record (1, 2, . . . , n) contains a value for each of the attributes (1, 2, . . . , m). For example, attribute 4 may represent the age of a user and attribute 3 may indicate whether that user has accessed a certain web page. Therefore, the user represented by record 2 accessed the web page as represented by attribute 3 and is age 36 as represented by attribute 4. Each record, together with all its attributes, is commonly referred to as a xe2x80x9ccasexe2x80x9d.
Classification techniques allow a data analyst (e.g., web master) to group the records of a collection (dataset or population) into classes. That is, the data analyst reviews the attributes of each record, identifies classes, and then assigns each record to a class. FIG. 1B illustrates the results of classifying a collection. The data analyst has identified three classes: A, B, and C. In this example, records 1 and n have been assigned to class A; record 2 has been assigned to class B, and records 3 and nxe2x88x921 have been assigned to class C. Thus, the data analyst determined that the attributes for rows 1 and n are similar enough to be in the same class. In this example, a record can only be in one class. However, certain records may have attributes that are similar to more than one class. Therefore, some classification techniques, and more generally some categorization techniques, assign a probability that each record is in each class. For example, record 1 may have a probability of 0.75 of being in class A, a probability of 0.1 of being in class B, and a probability of 0.15 of being in class C. Once the data analyst has classified the records, standard classification techniques can be applied to create a classification rule that can be used to automatically classify new records as they are added to the collection. (see, e.g., R. Duda et al, Pattern Classification and Scene Analysis ((copyright) 1973, John Wiley and Sons) (hereinafter the xe2x80x9cDuda et alxe2x80x9d textbook) which is incorporated by reference herein)). FIG. 1C illustrates the automatic classification of record n+1 when it is added to the collection. In this example, the new record was automatically assigned to class B.
Clustering techniques provide an automated process for analyzing the records of the collection and identifying clusters of records that have similar attributes. For example, a data analyst may request a clustering system to cluster the records into five clusters. The clustering system would then identify which records are most similar and place them into one of the five clusters. (See, e.g., the Duda et al textbook) Also, some clustering systems automatically determine the number of clusters. FIG. 1D illustrates the results of the clustering of a collection. In this example, records 1, 2, and n have been assigned to cluster A, and records 3 and nxe2x88x921 have been assigned to cluster B. Note that in this example the values stored in the column marked xe2x80x9cclusterxe2x80x9d in FIG. 1D have been determined by the clustering algorithm.
Once the categories (e.g., classes and clusters) are established, the data analyst can use the attributes of the categories to guide decisions. For example, if one category represents users who are mostly teenagers, then a web master may decide to include advertisements directed to teenagers in the web pages that are accessed by users in this category. However, the web master may not want to include advertisements directed to teenagers on a certain web page if users in a different category who are senior citizens who also happen to access that web page frequently. Even though the categorization of the collection may reduce the amount of data from thousands of records, a data analyst still needs to review possibly 10 or 20 categories. The data analyst still needs to understand the similarity and dissimilarity of the records in the categories so that appropriate decisions can be made.
Currently, the Internet is revolutionizing commerce by providing a relatively low cost platform for vendors and a very convenient platform for consumers through which consumers, in the form of Internet users, and vendors can engage in commerce. Not only are certain vendors merely appearing through a so-called web presence, but existing traditional, so-called xe2x80x9cbricks and mortarxe2x80x9d, retail establishments are augmenting their sales mechanisms through implementation of electronic commerce web sites. To facilitate this commerce, various computer software manufacturers have developed and now have commercially available software packages which can be used to quickly implement and deploy, and easily operate a fully-functional electronic commerce web site. One such package is a xe2x80x9cCommerce Serverxe2x80x9d software system available from the Microsoft Corporation of Redmond, Wash. (which is also the present assignee hereof). In essence and to the extent relevant, the xe2x80x9cCommerce Serverxe2x80x9d system provides a very comprehensive, scalable processing infrastructure through which customized business-to-consumer and business-to-business electronic commerce web sites can be quickly implemented. This infrastructure, implemented on typically a web server computer, provides user profiling, product cataloguing and content management, transaction processing, targeted marketing and merchandizing functionality, and analysis of consumer buying activities.
With the rapid and burgeoning deployment of electronic commerce web sites, web site owners have realized that voluminous consumer data gathered and provided through such a site, and particularly its electronic commerce server, provides a wealth of useful information. Through this information, on-line consumer buying patterns can be discerned and targeted advertising, even to the point of directed targeted advertising to a particular individual based on that person""s particular buying habits and/or interests, can be rendered which, in turn, generally yields significantly higher response rates and improved user experiences over that resulting from traditional mass media advertising and at significantly lower costs to the vendor.
Yet, a practical difficulty has arisen. While categories (also known as classes) can be readily and automatically extracted from data, such as on-line consumer transaction data, through well-known conventional clustering techniques such as the xe2x80x9cEMxe2x80x9d algorithm, it has proven to be rather difficult to present category data in a simple meaningful and easily understood manner to a business manager who is making marketing or other decisions based on that data. Generally, in the past, category data was simply provided as textual lists, that typically listed a number of consumers in each category and an associated probabilistic or other numeric measure (collectively xe2x80x9cmetricsxe2x80x9d) associated with each user and each category. These users and categories could then be compared against each other through assessing their metrics to discern trends or other information of interest.
However, textual data, particularly if it is voluminous, which is very often the case with consumer purchasing data, is extremely tedious for an analyst to quickly comprehend (i.e., xe2x80x9cdigestxe2x80x9d) particularly when looking for trends or other relationships that are xe2x80x9chiddenxe2x80x9d in the data. Furthermore, while conventional clustering techniques, such as the xe2x80x9cEMxe2x80x9d algorithm, are rather effective in clustering the data, based on discerned relationships amongst different cases in the data (a case being a single record with all its associated attribute data, as discussed above), oftentimes the resulting clusters are simply mathematical constructs in a flat list. The resulting clusters provide little, if any and often no, physically discernible basis in reality, i.e., the qualitative meaning and physical distinctions (apart from differences in mathematical metrics) between different clusters are unclear, if not very difficult, to comprehend. In essence, the question of xe2x80x9cWhat do the clusters represent?xe2x80x9d can become very difficult for the data analyst to answer. Hence, useful distinctions effectively become lost in the results, thus frustrating not only a data analyst who is then working with that data but also ultimately a business manager who, in an effort to reduce business risk, may need to make costly marketing and sales decisions, such as how to effectively market a given product and to whom and when, based on that data.
Given the difficulty associated with assessing text-based clustering results, various techniques have been developed in the art for visualizing clustered data, and particularly its classifications, in an attempt to facilitate and aid, e.g., the analyst or business manager in extracting useful relationships from the data.
One technique that exists in the art is described in published International patent application WO 90/04321 to S. R. Barber et al (published on Apr. 19, 1990). This technique relies on dynamically classifying data into non-exclusive pre-defined categories with those categories then being displayed as leaves in a semantic network. While this technique is certainly useful, it is not applicable to situations where the categories are not known beforehandxe2x80x94as often occurs with consumer data.
A basic need of any such visualization system is to provide cluster information in a manner that allows its viewer to readily appreciate essential differences between the cases in a cluster, i.e., those distinctions that characterize the data.
Thusfar, the visualization tools available in the art for depicting clusters and their inter-relationships have proven to be quite deficient in practice in meeting this need, particularly, though certainly not exclusively, when utilized in an electronic commerce setting.
In that regard, a visualization tool needs to automatically cluster data without prior knowledge of categories, i.e., the tool must discern the categories from the data itself.
Furthermore, data relationships are often far more complex than those depicted through a two-level network. Often, categories form parts of multi-level hierarchies, with the qualitative basis for those relationships only appearing evident when all or most of the hierarchy is finally extracted from the data and exposed. Furthermore, as noted, hierarchical distinctions, that are often quite granular, are the product of mathematical clustering techniques and from a qualitative standpoint, may be essentially meaningless; hence, necessitating a need to dynamically reduce a depth of the hierarchy to eliminate these distinctions and thus provide meaningful visual results to, e.g., the data analyst and business manager.
Moreover, to enhance understanding of what individual clusters mean and their inter-relationships, a user of the visualization system should also be able to readily browse through a hierarchy of displayed clusters, and, if desired, select individual clusters for comparison with each otherxe2x80x94where, to facilitate browsing, the displayed clusters are organized based on their similarity to each other. That user should also be able to expand or contract the displayed hierarchy, as desired, to enhance understanding the relationships that exist amongst the various clusters. In that regard, these clusters should also be scored, through similarity metrics, and ranked accordingly, with the results being visually displayed in a meaningful graphical manner. Summarized data for each cluster should also be meaningfully displayed.
Thus, the present invention is directed at providing an interactive cluster visualization tool which properly addresses and satisfies these heretofore unfilled needs in the art. Such a tool is particularly, though certainly not exclusively, suited for use in servers designed to support electronic commerce.
Advantageously, the present invention overcomes the deficiencies associated with cluster visualization systems known in the art.
In accordance with the inventive teachings, one embodiment of the present invention provides a cluster (category) visualization (xe2x80x9cCVxe2x80x9d) system that, given a set of incoming data records, automatically determines proper categories for those records, without prior knowledge of any such categories; clusters the records accordingly into those categories; and thereafter presents a graphic display of the categories of a collection of those records referred to as xe2x80x9ccategory graph.xe2x80x9d The CV system may optionally display the category graph as a xe2x80x9csimilarity graphxe2x80x9d or a xe2x80x9chierarchical map.xe2x80x9d When displaying a category graph, the CV system displays a graphic representation of each category. The CV system displays the category graph as a similarity graph or a hierarchical map in a way that visually illustrates the similarity between categories. The display of a category graph allows a data analyst to better understand the similarity and dissimilarity between categories. A similarity graph includes a node for each category and an arc connecting nodes representing categories whose similarity is above a threshold. A hierarchical map is a tree structure that includes a node for each base category along with nodes representing combinations of similar categories.
The CV system calculates and displays various characteristic and discriminating information about the categories. In particular, the CV system displays information describing the attributes of a category that best discriminate the records of that category from another category. The CV system also displays information describing the attributes that are most characteristic of a category.
A second and increasingly sophisticated embodiment of the present invention not only provides automatic category determination and record clustering and display, but also provides a visualization tool that, for summarized cluster data in the form of segments, calculates similarity measures therebetween, and, based on those measures, forms and graphically depicts multi-level hierarchical organizations of those segments. The system also compares two user-selected segments or segment groups together and graphically displays normalized scored comparison results, and by so doing, readily enhances and facilitates user understanding of inter-relationships among a data population represented by the clusters.
Furthermore, since some clustering distinctions, which are the product of mathematical clustering techniques, may be rather granular from a quantitative perspective but essentially meaningless, from a qualitative standpoint, this embodiment automatically and dynamically changes the hierarchy, based on similarity measures, to eliminate these distinctions, by reducing, where appropriate, the number of hierarchical levels and inter-nodal links. By doing so, this embodiment provides meaningful results in a visual fashion that facilitates user discovery and understanding of inter-relationships then existing in the data population.
In addition, to further enhance user understanding of these inter-relationships, this second embodiment also permits a user to readily browse through the hierarchy of displayed segments, and expand or contract the hierarchy, as desired, to further expose the relationships amongst the various segments. In that regard, the displayed segments are scored, through similarity metrics with the results being visually displayed. Attribute/value data that tends to meaningfully characterize each segment is also scored, rank ordered based on normalized scores and then graphically displayed.
In accordance with a feature of the present invention, segments and segment groups can be scored, based on their similarity, through various different alternate techniques, with one such technique being discriminant-based. Advantageously, this particular technique statistically balances the similarity measure between two segments or segment groups with the strength of its support, i.e., amount of the underlying evidence (e.g., number of records (event observations) in each segment or segment group).