1.1 Field of the Invention
The present invention relates generally to the technical field of data mining and/or text mining. More particularly the current invention is suggesting mining technology to improve the performance and scalability in data spaces with a large number of dimensions.
1.2 Description and Disadvantages of Prior Art
Data mining and text mining (collectively referred to as data mining in the following) in its most fundamental meaning addresses the problem of looking at authentic information from different directions and thereby gathering derived information. This “information about information” (meta-information) is often not obvious, but it opens new horizons because it helps to abstract from the plain data and see the “big picture” from a higher level.
Data mining is usually dealing with high dimensional data. Each item or data member consists of n attributes or features that characterize and specify the individual data item in more detail. If, for example, we are analyzing meteorological data sets, then each data item could be a cube of air in stratosphere that owns n features such as temperature, humidity or pressure. Each feature is called a variable and any algorithm that allows for data mining has to deal with a multitude of variables simultaneously. The goal is to discover interesting patterns in such an n-dimensional data set. “Interesting” in this context is defined by a data mining function (e.g. clustering, classification, regression, . . . ) and a set of control parameters. In particular these control parameters are used to specify properties of the mining result, to tailor the algorithmic procedure or in general to control the mining target of the data mining function. The original amount of data on which data mining operates is typically huge, as it usually describes a complex environment. As a result, new methods have been developed to keep the handling of such immense data efficient in terms of performance, usage of resources such as computer storage as well as scalability of the applied mining technology with the increasing number of dimensions of the underlying data spaces.
Prior art data mining on high-dimensional information is performed with algorithms and mining technology that work in n-dimensional space. While the performance of these algorithms is acceptable with few dimensions, they do not scale well with many or even large number of dimensions. To overcome this limitation of data mining in high dimensional data spaces, several strategies have been developed.
One possible proposed solution is to reduce high dimensionality by dropping those dimensions that are supposed to play a minor role in the following analysis step. This method is most often performed on a “best guess” basis as it intentionally drops information without knowing the exact impact on the final result. Another disadvantage of this approach is the need for human intervention for selecting the most relevant dimensions, i.e. features.
It has also been tried to capture most of the information by defining a new set of (derived) variables, such that some of the new variables hold most information while others contribute only little and can therefore be neglected (Principal Component Analysis, PCA). Often, however, the number of variables is still too large or the loss of information too big in order to be regarded as a practical approach.
In another approach specific algorithms and methods have been developed that are tailored to a specific problem in high dimensional space. In this case, special assumptions about the data can allow efficient processing, but with any other problem, where these assumptions do not hold, the algorithm will not work.