The present invention generally relates to data exploration and analysis techniques and, in particular, to systems and methods for ordering and visualizing categorical data for use in such data exploration and analysis techniques.
Visual representation has become increasingly important in conveying and interpreting information from a large amount of data. This is because it is known that human visual perception is remarkably good at identifying interesting patterns and spatial relationships. Good, effective visual representation can present information in a way that maximally exploits our visual skills so as to reveal interesting trends and anomalies hidden in data. A large number of data attributes in real data sets are categorical. A categorical value conveys the category of an object. There is typically neither a natural order nor distances associated with categorical values. For example, consider a data set representing a temporal sequence of events with such attributes as host name, event name, event severity. Although we can arguably define a meaningful order of event severity, there is no natural way of defining distances and an order of host names and event names.
While considerable research has been done on visualizing numerical data by directly leveraging its inherent geometric properties in constructing a visualization, there has been much less work on visualizing and extracting a structure in categorical data. Clearly, the lack of an order of attribute values adds additional complexity. This is because there are exponentially many ways in which the categorical values can be totally ordered. However, it is unlikely that all such orders produce equally effective visualizations. Ma and Hellerstein identified the problem and showed that the quality of the ordering algorithm is crucial for effectively visualizing categorical data, see U.S. patent application identified by Ser. No. 09/422,708, filed on Oct. 21, 1999 in the names of S. Ma and J. L. Hellerstein and entitled xe2x80x9cSystems and Methods for Ordering Categorical Attributes to Better Visualize Multidimensional Data,xe2x80x9d; S. Ma and J. L. Hellerstein, xe2x80x9cOrdering categorical data to improve visualization,xe2x80x9d Proceedings of the IEEE Symposium on Information Visualization, 1999; and S. Ma and J. L. Hellerstein, xe2x80x9cEventBrowser: exploratory analysis of event data for event management,xe2x80x9d DSOM 1999, the disclosures of which are incorporated by reference herein.
To illustrate the importance of ordering categorical values, we consider the same data set as used by Ma and Hellerstein in the above-referenced disclosures. The data set contains over 10,000 events generated by 160 hosts with 20 event types over a three-day period. FIG. 1 shows a scatter plot of the data set, in which the x-axis and the y-axis represent the time and the host name (e.g., an identifier (id) of a host in a network of computing devices) of an event, respectively. In this plot, since host names are categorical, they must somehow be mapped to geometric coordinates (on the y-axis). The order of host names in FIG. 1 is a random permutation of host names. Unfortunately, the scatter plot of FIG. 1 produces results that are not particularly revealing because of the random ordering scheme. Thus, it is evident that some better ordering or mapping is required to provide a higher quality visualization of the data set.
A key issue addressed by Ma and Hellerstein in the above-referenced disclosures, which is also a focus of the present invention, is how to find a mapping that results in an effective visualization. Clearly, a guiding principle behind the construction of such a mapping is to utilize the geometric proximity to capture relationships between objects. That is, we want similar, related objects to be placed close to each other.
Numerous research efforts and commercial products have applied visualization techniques to categorical data sets, e.g., M. O. Ward, xe2x80x9cXmdvTool: Integrating multiple methods for visualizing multivariate data,xe2x80x9d Proceedings of the Conference on Visualization (Los Alamitos, Calif., USA) IEEE Computer Society Press, pp. 326-336, October 1994; Diamond software from IBM Corporation; and U.S. patent application identified by Ser. No. 09/359,874, filed on July 27, 1999, and entitled xe2x80x9cSystems and Methods for Exploratory Analysis of Data for Event Management,xe2x80x9d the disclosures of which are incorporated by reference herein. Such efforts can be classified into four classes.
One simple approach is to order categorical values based on an auxiliary numerical attribute or to order the values alphabetically. In our previous example, we can order host names by the time of their first occurrence in the data set. This approach is based on the assumption that there is some causality in the order of events generated by a system. However, this approach is not related to a visual task, and thus can not, in general, provide the best visualization quality. The performance associated with this approach gets worse as the size and the complexity of the data set grows.
The second class is mostly focused on clustering-based approaches, see, e.g., V. Ganti et al., xe2x80x9cCACTUS: Clustering categorical data using summaries,xe2x80x9d Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM Press, pp. 73-83, August 1999; D. Gibson et al., xe2x80x9cClustering categorical data: An approach based on dynamic systems,xe2x80x9d Proceedings of the 24th International Conference on Very Large Data Bases, VLDB, pp. 311-322, August 1998; S. Guha et al, xe2x80x9cRock: a robust clustering algorithm for categorical attributes,xe2x80x9d Proc. of the 15th Int. Conf. on Data Eng., 1999; and S. Ma and J. L. Hellerstein, xe2x80x9cOrdering categorical data to improve visualization,xe2x80x9d Proceedings of the IEEE Symposium on Information Visualization, 1999, the disclosures of which are incorporated by reference herein.
Clustering is a natural way of getting an insight into the data set. However, three issues effect its value for visualization purposes. First, although clusters can be identified in the geometric space, cluster descriptions are still unordered, and some additional nontrivial methods are needed to order and visualize the clusters, and to order and visualize the elements within each cluster. Second, most clustering algorithms prefer certain, usually very structured cluster shapes (e.g., rectangular regions of the above-referenced CACTUS approach), and always tend to partition the data into clusters of such shapes even though there may be no clustering tendency in the data set at all. In particular, the CACTUS approach showed that the above-referenced Gibson approach is not able to discover several natural classes of clusters, such as clusters with overlapping projections on some subset of attributes. We feel that making any assumption about the clustering structure of the data defeats the whole purpose of using clustering algorithms to extract structures. The present invention is directed toward techniques for revealing the order without imposing any prior assumption on the data. To do so, the present invention formulates the problem using an optimization framework.
Related to the second class of algorithms, the third approach proposed in the above-referenced U.S. patent application identified by Ser. No. 09/359,874, is based on hierarchical ordering. The approach provides for iteratively grouping the closest pair of points (in respect to some similarity function) and replaces the pair by a single point. The points are thus fashioned into a strict hierarchy of nested ordered subsets. The length of the shortest path between any two subsets corresponds to the degree of their similarity. Constructing a global ordering reduces to locally ordering pairs of subsets recursively in a bottom-up fashion. In this approach, a strict hierarchical tree can not reflect the multiple different ways in which points can be related, and this situation gets more severe as the size and the complexity of the data grow. Second, the hierarchical ordering is deterministic in nature (once the points are grouped together, there is no opportunity to reevaluate the grouping, and thus the ordering).
The fourth approach is to use a projection method, such as multidimensional scaling (MDS) or the algorithm disclosed in C. Faloutsos et al., xe2x80x9cFastMap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia data sets,xe2x80x9d Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data (San Jose, Calif.) pp. 163-174, May 1995, the disclosure of which is incorporated by reference herein. There is a fundamental limitation associated with the objectives of these techniques. MDS produces a low-dimensional visual representation of the data that preserves the distances between the original data points (with respect to the similarity function) as faithfully as possible. The problem is that for visual exploration purposes, one usually does not know in advance what one is looking for, and does not have a good similarity function. Therefore, preserving the distances with respect to a specific similarity function should not be a goal in itself, rather the order should be consistent with a reasonable similarity function. That is, if the point u is more similar to v than to w, then u should be mapped closer to v than to w in the order; however, preserving the actual ratio of the similarities should not be a goal. In other words, it would be more desirable to have the order be topology-preserving rather than distance-preserving (with distance being measured with respect to some specific similarity function). Moreover, MDS is prohibitively expensive for large data sets and does not allow for incrementally mapping new points to existing projections (once a new point is added, the entire mapping basically has to be recomputed).
The present invention provides techniques for ordering categorical attributes so as to better visualize data. In a first aspect of the invention, a computer-based technique of ordering categorical values of one or more attributes associated with a data set comprises the following steps/operations. First, the categorical values to be ordered are obtained. Given these categorical values, the task of ordering the categorical values is then formulated as a continuous optimization ordering problem. Once the task is formulated as a continuous optimization ordering problem, at least one continuous (preferably optimal) ordering solution to the continuous optimization ordering problem is computed. The technique may also include mapping the computed continuous ordering solution from a continuous space to a discrete space. At least a portion of the computed continuous ordering solution may be made available for use in accordance with a data visualization system.
In one embodiment, the step/operation of forming a task to order the categorical values as a continuous optimization ordering problem may comprise computing a similarity matrix based on the categorical values of the one or more attributes. The similarity matrix may be based on one or more multi-set operations. The one or more multi-set operations may comprise computing two types of similarity measurements, a first type being a similarity measure computed between two categorical values from the same attribute, and a second type being a similarity measure computed between two categorical values from different attributes. Further, the step/operation of forming a task to order the categorical values as a continuous optimization ordering problem may further comprise computing a Laplace matrix from the similarity matrix. Then, the step/operation of computing at least one continuous ordering solution to the continuous optimization ordering problem may comprise finding the smallest positive eigenvalue of the Laplace matrix, followed by obtaining a corresponding optimal eigenvector from the smallest positive eigenvalue of the Laplace matrix. The categorical values may then be ordered in accordance with corresponding values associated with the optimal eigenvector.
In a second aspect of the invention, techniques for ordering categorical values relating to multiple attributes are provided. In accordance with such techniques, prior to forming a task to order the categorical values as a continuous optimization ordering problem, the categorical values relating to the multiple attributes are mapped into a set of objects such that the above forming and computing steps/operations are performed in association with the set of objects.
In a third aspect of the invention, multi-level framework techniques for ordering categorical values of one or more attributes associated with a data set comprise the following steps/operations. First, the categorical values to be ordered are obtained. Given these categorical values, the categorical values are modeled as an original graph structure with vertices being the categorical values to be ordered and the weight of an edge representing the similarity of connected vertices. The original graph structure is then approximated by a hierarchical sequence of one or more coarser graph structures, wherein vertices that have a similarly local structure are merged into a vertex in a coarser graph structure. The coarsest graph structure is ordered in accordance with a continuous optimization ordering operation. The ordering of the coarsest graph structure is propagated back through to the original graph structure, and at least a portion of the propagated ordering associated with the original graph structure is made available for use in accordance with a data visualization system.
These and other objects, features and advantages of the present invention will become apparent from the following detailed description of illustrative embodiments thereof, which is to be read in connection with the accompanying drawings.