1. Field of the Invention
The invention relates to a system that incorporates an interactive graphical user interface for graphically visualizing clusters (specifically segments) of data. Specifically, the system automatically categorizes incoming case data into clusters, summarizes those clusters into segments, determines similarity measures for those particular segments and then forms and visually depicts hierarchical organizations of those segments. The system also compares two user-selected segments or segment groups together and graphically displays normalized scored comparison results. Additionally, the system also automatically and dynamically reduces, as necessary, a depth of the hierarchical organization (total number of hierarchical levels) based on scored similarity measures of the selected clusters; and, based on normalized scores, provides and displays a relative ranking of the displayed segments, as well as displays summarized characteristics of any such segment.
2. Description of the Prior Art
Computer systems have long been used for data analysis. For example, data may include demographics of users and web pages accessed by those users. A web master (i.e., a manager of a web site) may desire to review web page access patterns of those users in order to optimize links between various web pages or to customize advertisements to the demographics of the users. However, it may be very difficult for the web master to analyze the access patterns of thousands of users involving possibly hundreds of web pages. However, this difficulty may be lessened if the users can be categorized by common demographics and common web page access patterns. Two techniques of data categorization—classification and clustering—can be useful when analyzing large amounts of such data. These categorization techniques are used to categorize data represented as a collection of records, each containing values for various attributes. For example, each record may represent a user, and the attributes describe various characteristics of that user. The characteristics may include the sex, income, and age of the user, or web pages accessed by the user. FIG. 1A illustrates a collection of records organized as a table. Each record (1, 2, . . . , n) contains a value for each of the attributes (1, 2, . . . , m). For example, attribute 4 may represent the age of a user and attribute 3 may indicate whether that user has accessed a certain web page. Therefore, the user represented by record 2 accessed the web page as represented by attribute 3 and is age 36 as represented by attribute 4. Each record, together with all its attributes, is commonly referred to as a “case”.
Classification techniques allow a data analyst (e.g., web master) to group the records of a collection (dataset or population) into classes. That is, the data analyst reviews the attributes of each record, identifies classes, and then assigns each record to a class. FIG. 1B illustrates the results of classifying a collection. The data analyst has identified three classes: A, B, and C. In this example, records 1 and n have been assigned to class A; record 2 has been assigned to class B, and records 3 and n−1 have been assigned to class C. Thus, the data analyst determined that the attributes for rows 1 and n are similar enough to be in the same class. In this example, a record can only be in one class. However, certain records may have attributes that are similar to more than one class. Therefore, some classification techniques, and more generally some categorization techniques, assign a probability that each record is in each class. For example, record 1 may have a probability of 0.75 of being in class A, a probability of 0.1 of being in class B, and a probability of 0.15 of being in class C. Once the data analyst has classified the records, standard classification techniques can be applied to create a classification rule that can be used to automatically classify new records as they are added to the collection. (see, e.g., R. Duda et al, Pattern Classification and Scene Analysis (© 1973, John Wiley and Sons) (hereinafter the “Duda et al” textbook) which is incorporated by reference herein)). FIG. 1C illustrates the automatic classification of record n+1 when it is added to the collection. In this example, the new record was automatically assigned to class B.
Clustering techniques provide an automated process for analyzing the records of the collection and identifying clusters of records that have similar attributes. For example, a data analyst may request a clustering system to cluster the records into five clusters. The clustering system would then identify which records are most similar. and place them into one of the five clusters. (See, e.g., the Duda et al textbook) Also, some clustering systems automatically determine the number of clusters. FIG. 1D illustrates the results of the clustering of a collection. In this example, records 1, 2, and n have been assigned to cluster A, and records 3 and n−1 have been assigned to cluster B. Note that in this example the values stored in the column marked “cluster” in FIG. 1D have been determined by the clustering algorithm.
Once the categories (e.g., classes and clusters) are established, the data analyst can use the attributes of the categories to guide decisions. For example, if one category represents users who are mostly teenagers, then a web master may decide to include advertisements directed to teenagers in the web pages that are accessed by users in this category. However, the web master may not want to include advertisements directed to teenagers on a certain web page if users in a different category who are senior citizens who also happen to access that web page frequently. Even though the categorization of the collection may reduce the amount of data from thousands of records, a data analyst still needs to review possibly 10 or 20 categories. The data analyst still needs to understand the similarity and dissimilarity of the records in the categories so that appropriate decisions can be made.
Currently, the Internet is revolutionizing commerce by providing a relatively low cost platform for vendors and a very convenient platform for consumers through which consumers, in the form of Internet users, and vendors can engage in commerce. Not only are certain vendors merely appearing through a so-called web presence, but existing traditional, so-called “bricks and mortar”, retail establishments are augmenting their sales mechanisms through implementation of electronic commerce web sites. To facilitate this commerce, various computer software manufacturers have developed and now have commercially available software packages which can be used to quickly implement and deploy, and easily operate a fully-functional electronic commerce web site. One such package is a “Commerce Server” software system available from the Microsoft Corporation of Redmond, Washington (which is also the present assignee hereof). In essence and to the extent relevant, the “Commerce Server” system provides a very comprehensive, scalable processing infrastructure through which customized business-to-consumer and business-to-business electronic commerce web sites can be quickly implemented. This infrastructure, implemented on typically a web server computer, provides user profiling, product cataloguing and content management, transaction processing, targeted marketing and merchandizing functionality, and analysis of consumer buying activities.
With the rapid and burgeoning deployment of electronic commerce web sites, web site owners have realized that voluminous consumer data gathered and provided through such a site, and particularly its electronic commerce server, provides a wealth of useful information. Through this information, on-line consumer buying patterns can be discerned and targeted advertising, even to the point of directed targeted advertising to a particular individual based on that person's particular buying habits and/or interests, can be rendered which, in turn, generally yields significantly higher response rates and improved user experiences over that resulting from traditional mass media advertising and at significantly lower costs to the vendor.
Yet, a practical difficulty has arisen. While categories (also known as classes) can be readily and automatically extracted from data, such as on-line consumer transaction data, through well-known conventional clustering techniques such as the “EM” algorithm, it has proven to be rather difficult to present category data in a simple meaningful and easily understood manner to a business manager who is making marketing or other decisions based on that data. Generally, in the past, category data was simply provided as textual lists, that typically listed a number of consumers in each category and an associated probabilistic or other numeric measure (collectively “metrics”) associated with each user and each category. These users and categories could then be compared against each other through assessing their metrics to discern trends or other information of interest.
However, textual data, particularly if it is voluminous, which is very often the case with consumer purchasing data, is extremely tedious for an analyst to quickly comprehend (i.e., “digest”) particularly when looking for trends or other relationships that are “hidden” in the data. Furthermore, while conventional clustering techniques, such as the “EM” algorithm, are rather effective in clustering the data, based on discerned relationships amongst different cases in the data (a case being a single record with all its associated attribute data, as discussed above), often times the resulting clusters are simply mathematical constructs in a flat list. The resulting clusters provide little, if any and often no, physically discernible basis in reality, i.e., the qualitative meaning and physical distinctions (apart from differences in mathematical metrics) between different clusters are unclear, if not very difficult, to comprehend. In essence, the question of “What do the clusters represent?” can become very difficult for the data analyst to answer. Hence, useful distinctions effectively become lost in the results, thus frustrating not only a data analyst who is then working with that data but also ultimately a business manager who, in an effort to reduce business risk, may need to make costly marketing and sales decisions, such as how to effectively market a given product and to whom and when, based on that data.
Given the difficulty associated with assessing text-based clustering results, various techniques have been developed in the art for visualizing clustered data, and particularly its classifications, in an attempt to facilitate and aid, e.g., the analyst or business manager in extracting useful relationships from the data.
One technique that exists in the art is described in published International patent application WO 90/04321 to S. R. Barber et al (published on Apr. 19, 1990). This technique relies on dynamically classifying data into non-exclusive pre-defined categories with those categories then being displayed as leaves in a semantic network. While this technique is certainly useful, it is not applicable to situations where the categories are not known beforehand—as often occurs with consumer data.
A basic need of any such visualization system is to provide cluster information in a manner that allows its viewer to readily appreciate essential differences between the cases in a cluster, i.e., those distinctions that characterize the data.
Thusfar, the visualization tools available in the art for depicting clusters and their inter-relationships have proven to be quite deficient in practice in meeting this need, particularly, though certainly not exclusively, when utilized in an electronic commerce setting.
In that regard, a visualization tool needs to automatically cluster data without prior knowledge of categories, i.e., the tool must discern the categories from the data itself.
Furthermore, data relationships are often far more complex than those depicted through a two-level network. Often, categories form parts of multi-level hierarchies, with the qualitative basis for those relationships only appearing evident when all or most of the hierarchy is finally extracted from the data and exposed. Furthermore, as noted, hierarchical distinctions, that are often quite granular, are the product of mathematical clustering techniques and from a qualitative standpoint, may be essentially meaningless; hence, necessitating a need to dynamically reduce a depth of the hierarchy to eliminate these distinctions and thus provide meaningful visual results to, e.g., the data analyst and business manager.
Moreover, to enhance understanding of what individual clusters mean and their inter-relationships, a user of the visualization system should also be able to readily browse through a hierarchy of displayed clusters, and, if desired, select individual clusters for comparison with each other—where, to facilitate browsing, the displayed clusters are organized based on their similarity to each other. That user should also be able to expand or contract the displayed hierarchy, as desired, to enhance understanding the relationships that exist amongst the various clusters. In that regard, these clusters should also be scored, through similarity metrics, and ranked accordingly, with the results being visually displayed in a meaningful graphical manner. Summarized data for each cluster should also be meaningfully displayed.
Thus, the present invention is directed at providing an interactive cluster visualization tool which properly addresses and satisfies these heretofore unfilled needs in the art. Such a tool is particularly, though certainly not exclusively, suited for use in servers designed to support electronic commerce.