Computer systems have long been used for data analysis. For example, data may include demographics of users and web pages accessed by those users. A web master (i.e., a manager of a web site) may desire to review web page access patterns of those users in order to optimize links between various web pages or to customize advertisements to the demographics of the users. However, it may be very difficult for the web master to analyze the access patterns of thousands of users involving possibly hundreds of web pages. However, this difficulty may be lessened if the users can be categorized by common demographics and common web page access patterns.
Two techniques of data categorization—classification and clustering—can be useful when analyzing large amounts of such data. These categorization techniques are used to categorize data represented as a collection of records, each containing values for various attributes. For example, each record may represent a user, and the attributes describe various characteristics of that user. The characteristics may include the sex, income, and age of the user, or web pages accessed by the user. Each record, together with all its attributes, is commonly referred to as a “case”.
Classification occurs when each record has a “class” value, and an attempt is made to predict that value given other values in the record. For example, records corresponding to a user may be classified by gender given income, age, and web pages accessed. However, certain records may have attributes that indicate similarity to more than one class. Therefore, some classification techniques, and more generally some categorization techniques, assign a probability that each record is in each class.
Clustering techniques provide an automated process for analyzing the records of the collection and identifying clusters of records that have similar attributes. For example, a data analyst may request a clustering system to cluster the records into five clusters. The clustering system would then identify which records are most similar and place them into one of the five clusters. Also, some clustering systems automatically determine the number of clusters.
Once the categories (classes or clusters) are established, the data analyst can use the attributes of the categories to guide decisions. For example, if one category represents users who are mostly teenagers, then a web master may decide to include advertisements directed to teenagers in the web pages that are accessed by users in this category. However, the web master may not want to include advertisements directed to teenagers on a certain web page if users in a different category who are senior citizens who also happen to access that web page frequently. Even though the categorization of the collection may categorize the data from thousands of records by sorting those records into 10 or 20 summary buckets, a data analyst still needs to review the data in these buckets. The data analyst still needs to understand the similarity and dissimilarity of the records in the categories so that appropriate decisions can be made.
With the rapid and burgeoning deployment of electronic commerce web sites, web site owners have realized that voluminous consumer data gathered and provided through such a site, and particularly its electronic commerce server, provides a wealth of useful information. Additionally, traditional commercial means (including so called “bricks-and-mortar stores”) also often incorporate and use systems that collect customer information. By analyzing customer data from whatever source, consumer buying patterns can be discerned. Targeted advertising, even to the point of directed targeted advertising to a particular individual based on that person's particular buying habits and/or interests, can be rendered. Such targeted advertising generally yields significantly higher response rates and improved user experiences over that resulting from traditional mass media advertising and at significantly lower costs to the vendor. Similarly, other types of data may be analyzed, and uses other than commercial uses are possible.
Yet, a practical difficulty has arisen. While both cluster models and classification models can be extracted from data, such as on-line consumer transaction data, through well-known conventional machine-learning techniques, it has proven to be rather difficult to present category data in a simple meaningful and easily understood manner, for example, to a business manager who is making marketing or other decisions based on that data. Generally, in the past, category data was simply provided as textual lists, that typically listed a number of consumers in each category and an associated probabilistic or other numeric measure (collectively “metrics”) associated with each user and each category. These users and categories could then be compared against each other through assessing their metrics to discern trends or other information of interest.
However, textual data, particularly if it is voluminous, which is very often the case with consumer purchasing data, is extremely tedious to quickly comprehend (i.e., “digest”) particularly when looking for trends or other relationships that are “hidden” in the data. Furthermore, while conventional categorization techniques are rather effective in categorizing the data, based on discerned relationships amongst different cases in the data (a case being a single record with all its associated attribute data, as discussed above), oftentimes the resulting clusters are simply mathematical constructs in a flat list. The resulting categories provide little, if any and often no, physically discernible basis in reality, i.e., the qualitative meaning and physical distinctions (apart from differences in mathematical metrics) between different categories are unclear, if not very difficult, to comprehend. In essence, the question of “What do the categories represent?” can become very difficult for the data analyst to answer. Hence, useful distinctions effectively become lost in the results, thus frustrating not only a data analyst who is then working with that data but also ultimately a business manager who, in an effort to reduce business risk, may need to make costly marketing and sales decisions, such as how to effectively market a given product and to whom and when, based on that data.
Given the difficulty associated with assessing text-based categorization results, various techniques have been developed in the art for visualizing clustered data, and particularly its classifications, in an attempt to facilitate and aid, e.g., the analyst or business manager in extracting useful relationships from the data.
A basic need of any such visualization system is to provide category information in a manner that allows its viewer to readily appreciate essential differences between the cases in a cluster, i.e., those distinctions that characterize the data. Thus far, the visualization tools available in the art for depicting clusters and their inter-relationships have proven to be quite deficient in practice in meeting this need, particularly, though certainly not exclusively, when utilized in an electronic commerce setting.
Thus, there is a need for a cluster or classification visualization tool that properly addresses and satisfies heretofore unfilled needs in the art. Such a tool is particularly, though certainly not exclusively, suited for use in servers designed to support electronic commerce.