This invention relates to a method for the automatic, software-driven statistical evaluation of large amounts of data that is to be assigned to statistical variables in a database. The data to be evaluated can, in particular, be contained in one or several clusters.
Nowadays databases are in the position to store immense amounts of data. In order to evaluate the stored data and to be able to extract profitable information, efficient i.e. quick and specific database accesses are required because of the data occupancy.
In general, for an evaluation all the data must be found that conforms to a pre-determinable condition. Often it is not the case that the located data itself must be known, but often only information about the statistics based on the data is required.
If, for example, in a customer relationship management (CRM) system in which customer data is stored, it be determined what proportion of customers with specific features bought a certain product, a simple procedure could be to access all the customer entries in the database, request all the features of the customers and under these to find out and count those entries which “match” the desired features for which the customers bought the specific product. For example, such a request to the database could be as follows: how often were specific mobile telephones purchased by male customers who are at least 30 years old? Therefore, all the customer entries that conform to the requirements “male” and “at least 30 years old” must be found in which case a test must be performed for the matching entries found to determine which mobile telephone was purchased the most.
However, a disadvantage of this procedure is the fact that the entire database has to be read to find the matching entries. This can occasionally take a very long time in the case of very large databases.
The database can be searched more skillfully and more efficiently if all the variables are provided with selective indexes that can be requested. In this case it is a rule that the more exact and sophisticated the applicable index technique of a database is, the quicker the database can be accessed. More efficient statistical information about the database entries can also be provided accordingly. This in particular applies if the database is specifically prepared by a special index technique for the requests to be expected.
Alternatively or in combination with index techniques, the results of all the statistical requests to be expected can be pre-calculated which has the disadvantage of considerable effort required for the calculations and storage of results.
The term “online analytical processing” (OLAP) characterizes a class of methods for extracting statistical information from the data of a database. In general, such methods can be subdivided into “relational online analytical processing” (ROLAP) and “multidimensional online analytical processing” (MOLAP).
The ROLAP method only makes slight pre-calculations. When requesting the statistics, the data about the index techniques required for a response to the request is accessed and the statistics are then calculated from the data. The emphasis of ROLAP is then on a suitable organization and indexing of the data to find and load the required data as quickly as possible. Nevertheless, the effort for large amounts of data can still be very great and in addition the selected indexing is sometimes not optimum for all the requests.
In the MOLAP method the focus is on pre-calculating the results for many possible requests. As a result, the response time for a pre-calculated request remains very short. For requests that have not been pre-calculated, the pre-calculated values can sometimes also lead to an acceleration if the desired sizes can be calculated from the pre-calculated results, and this means that it is more cost-effective than directly accessing the data. The number of all possible requests increases rapidly with the increasing number of states of these variables so that the pre-calculation hits against the limits of the present possibilities with regard to memory location and turnaround time. Restrictions with regard to the variables considered, the different states of these variables or the permissible requests must then be taken into consideration.
Even though the OLAP method guarantees an increase in the efficiency compared to merely accessing each database entry it is disadvantageous that a great amount of redundant information has to be generated here. Therefore, statistics must be pre-calculated and extensive index lists created. In general, an efficient application of an OLAP method also requires that this method is optimized to specific requests in which case the OLAP method is then also subject to these selected restrictions, i.e. no random requests can be made to the database.
In addition, it is also true for the OLAP method that, the more quickly the information is to be provided and the more this information varies, the more structures must be pre-calculated and stored. Therefore, OLAP systems can become very large and are by far less efficient than would be desired, response times of less than one second can in practice not be implemented for any statistical requests to a large database. Often the response times are considerably more than one second.
Therefore, there is a need for more efficient methods for the statistical evaluation of data entries. In such cases the requests should not be subject to any restrictions if possible.