Computer visualization tools are needed for presenting the results of ever increasing amounts of processed data. The conventional approach is to take some few variables at a time, process them and their relations, for example, with a spreadsheet, and display the result, for example, as bar charts and die charts. In a complex domain, where each data point may have several attributes, this conventional approach produces typically a great number of charts, with a very weak connection to each other. The charts are typically presented in as a sequence of charts. From such a sequence of charts it is usually very difficult to see and comprehend the overall significance of the results. In a more advanced case the data is processed instead of a spreadsheet with more elaborate techniques, such as statistical methods or neural networks, but the results are still typically presented in sequential form using conventional charts.
In the following description a term data vector having a certain number of components refers to a data point having a certain number of attributes. The attributes/components may have continuous or discrete numerical values or they can have ordinal or nominal values. The data vectors are vectors of a data domain or a data space. In a visualization process, high-dimensional data vectors are displayed using typically a two- or three-dimensional device. A corresponding visualization vector having usually two or three coordinates, which determine the location of a point representing the data vector on the display device, is determined typically for each data vector.
Efforts exist to display data in low-dimensional presentation using, for example, conventional scatter plots that visually represent data vectors as graphical objects plotted along one, two, or three axes. If each data vector has a great number of components, which are usually called attributes, problems are encountered since besides the three dimensions offered by a three-dimensional display, only a few additional dimensions can be represented in this manner by using, for example, color and shape variations when representing the data.
Another even more significant limitation concerns the use of more elaborate conventional data dimension reduction methods that can be used to define a visualization vector for a data vector. The goal is to replace the original high-dimensional data vectors with much shorter vectors, while losing as little information as possible. Consequently, a pragmatically sensible data reduction scheme is such that when two data vectors are close to each other in the data space, the corresponding visualization vectors are also close to each other in the visualization space. Traditionally the closeness of data vectors in the data space is in these methods defined via a geometric distance measure such as the Euclidean distance. The attributes of the data can be various and heterogeneous, and therefore various dimension of the data space can have different scaling and meaning. The geometric distances between the data vectors do not properly reflect the properties of complex data domains, where the data typically is not coded in a geometric or spatial form. In this type of domains, changing one bit in a vector may totally change the relevance of the vector, and make it in some sense a quite different vector, although geometrically the difference is only one bit. For example, as many data sets contain nominal or ordinal attributes, this means that some of the data vector components have nominal or ordinal values, and finding a reasonable coding with respect a geometric distance metric, for example the Euclidean distance metric, is a difficult task. In a Geometric distance metric, all attributes (vector components) are treated as equal. Therefore it is obvious that an attribute with a scale of, say, between −1000 and 1000, is more influential than an attribute with a range between −1 and 1. To circumvent this problem, the attributes can of course be normalized, but it is not at all clear what is the optimal way to implement the normalization. In addition, in real-world situations the similarity of two vectors is not a universal property, but depends on the specific focus of the user: even if two vectors can be regarded as similar from one point of view, they may appear quite dissimilar from another point of view.
A third significant limitation is related to data mining. Data mining is a process that uses specific techniques to find patterns in data, allowing a user to conduct a relatively broad search in databases for relevant information that may not be explicitly stored in the data. In a typical data mining process, a user initially specifies a search phrase or strategy and the system then extracts patterns and relations corresponding to that strategy, from the stored data. It usually takes some time for extracting the patterns, and therefore the extracted patterns and relations are presented to the user by a data analyst with a delay. The probably invoked new requests cause a new processing cycle with a relatively long time delay. There is thus a need for a data visualization tool/method that visually approximates in one instance the whole data domain although it includes a large number of variables. Furthermore, there is need for a tool/method where the results of the data mining process are visualized instantly and the data mining process is typically carried out in one session.