1. Field of the Invention
The present invention relates to visualizing scattered data points using a computer display.
2. Related Art
Computer visualization tools are called upon to handle ever increasing amounts of data. Conventional scatter plots visually represent multivariate data points as graphical glyphs plotted along one, two, or three axes. Each data point has one or more data attributes, also called variables. These data attributes can be numerical or categorical. Each axis can represent a different data attribute. Additional data attributes can be represented by varying the color or size of the glyphs.
Problems are encountered in visualizing scattered data when the number of data points is large. In general, each data point in a conventional scatter plot is represented by a corresponding glyph. As the number of scattered data points increases, more glyphs crowd a scatter plot display. The time it takes to render each glyph increases. The time it takes to build and display a scatter plot can become too long, thereby, precluding interactive, on-the-fly rendering of scattered data. Occlusion can also occur as data points in the foreground of a scatter plot hide data points behind them. A serious problem occurs when many data points occupy the same location.
To illustrate the above problem, consider a two-dimensional scatter plot containing millions of data points. It takes a very long time for a graphics processor to draw millions of glyphs covering all these data points. If each data point is represented by a single pixel on the screen, then there will be many overlapping data points. Only the data point for a glyph which is drawn last for a given pixel location will be seen.
T he same problems occur in three-dimensional scatter plots where three-dimensional (3-D) glyphs (e.g., cubes, spheres. etc.) are used to represent data points. These 3-D glyphs are plotted with respect to three scatter plot axes. Rendering such a 3-D scatter plot for large numbers of data points can take a long time, as many glyphs must be processed. Moreover, if there are many data points to be covered, glyphs in the foreground occlude those in the back. Also, data is hidden when the data points are clustered together. There is no easy way to examine data inside a cluster.
What is needed is a data visualization tool that visually approximates a scatter plot when a large number of data points needs to be drawn. Further, what is needed is a visualization tool that handles the case where a categorical variable has been mapped to the color of the scattered data points. To accomplish this using the splatting technique described herein, it is necessary to first determine distribution weights that represent values of a categorical variable in each bin, and then map a distinct color to each of the weights corresponding to the different values of the categorical variable in the scatter plot.
The present invention provides a method, system, and computer program product for a new data visualization tool for representing distribution weights that represent values of a categorical variable and then mapping a distinct color to each of the weights so as to visually represent the different values of the categorical variable (or data attribute) in a scatter plot. A special type of splat is used to represent the distribution of colored data points in a bin. In an alternate embodiment, distribution weights mapped to distinct colors are used to represent values for a numerical variable. Through a binning process, bins of scattered data points are formed. Each axis of a scatter plot is discretized according to a binning resolution. Bin positions along each discretized scatter plot axis are determined from bin numbers.
According to one embodiment of the present invention, the bins, which represent a cloud of scattered data points, are volume rendered as splats. The opacity of each splat is a function of the number (count or weight) of data points in a corresponding bin. The distinct colors of the splat are based on the distribution of categorical variable values in a corresponding bin. The variable which is mapped to splat colors is typically a variable other than one mapped to a scatter plot axis.
In one example of the present invention, the mapping of a categorical variable to color involves storing a vector of weights (counts) for each bin. (Bins are represented as rows in a table which are computed by aggregating an original data set). The vector is used to represent the distribution of the categorical variable values in the bin. The vector contains as many locations as the number of different values for the categorical variable. The value stored in each vector location is typically the percentage of the total weight of data in the bin for that particular value of the categorical variable. Each location in the vector is also associated with a distinct color. The splat used to represent a bin graphically needs to show the distribution of categorical variable values. The present invention describe a method in which this can be accomplished. The method involves a random set of opaque triangles, where a percentage of the triangles are of each color, and the total number of triangles map to bin weight. The coloring of a single splat with multiple colors involves the rendering of each vector by looping through each vector location, and then based on the weight stored in that location, randomly selecting the same percentage (or weight) of triangles in the splat for the color associated with that vector location.
According to a further feature of the present invention, a threshold is used to help reduce confusion and decrease processing time by summing all weights below the threshold and assigning to it a single neutral color. A slider or other controller can be used to vary the value of the threshold.
According to another embodiment of the present invention, interpolated data is used for animating an external query attribute of a scatter plot of data points in a computer system. An external query device (or slider) corresponding to an attribute of the data points is used to animate over that data attribute. If the slider control is positioned in between discrete positions of the slider, the displayed plot corresponds to interpolated data. First adjacent data structures (or data tables) are determined corresponding to the position of the external query means. The adjacent data structures are merged together, then aggregated the using the spatial columns of the data structure as a unique key. For a categorical variable, weights (or percentages) of the same value in vectors to be merged get aggregated together. An interpolated bin is generated, where the count (or weight) of the bin is interpolated and the weights in the vector is also interpolated, but weighted by the new count. The interpolated vector is mapped to color in the visualized scatter plot. The plot appears as rendered splats corresponding to bin positions of the interpolated bins, where each splat has an opacity that is a function of the interpolated count of data points in the corresponding bin. The present invention allows for the smooth animation of one or more external query attributes of the data points.
Further features and advantages of the present invention, as well as the structure and operation of various embodiments of the present invention, are described in detail below with reference to the accompanying drawings.