1. Field of the Invention
The present invention relates to a method for processing aggregate queries on a parallel computer system, that are required for a database processing, such as Data Mining. The aggregate query or aggregation includes a query of aggregation for each group in the database.
2. Related Art
The aggregation used in a field, such as Data Mining, is a process for calculating the total value, or the maximum value, or the minimum value, or the average value of one attribute relative to a certain relation in the database every group, like "group-by" in the structured query language, SQL, to the relational database. The group is defined as records whose another attribute has the same value or the same value set, or a value that is inside a predetermined range. This aggregation is recently indispensable to a decision support system, such as OLAP (On Line Analytical Processing) (for details, see "Beyond Decision Support," E. F. Codd, S. B. Codd and C. T. Salley, Computerworld, 27(30), Jul. 1993) and Data Mining.
An example aggregation will now be explained. Table 1 shows the relation stored in a database. In Table 1, product numbers, customer numbers and sales are entered in that separate columns, and each time the sale occurs, a tuple is filled. Assume a calculation for counting the sales for each product number in this relation. For this calculation, a group is formed for each product number, and the total is calculated for each group. This is one example of the above described aggregation. When the aggregation is executed, the results shown in Table 2 are obtained.
TABLE 1 product # customer # sold G1 C1 3 G1 C2 10 G2 C2 5 G2 C3 10 G3 C4 2
TABLE 2 product # sold G1 13 G2 15 G3 2
Algorithms for parallel processing for one aggregation have been studied, and these algorithms will now be described. It should be noted that, for each algorithm, the entire relation in a database is divided equally for each processor.
1. 2P Algorithm
Since this algorithm is performed in two phases, it is called 2P.
(1) As a first phase, each processor (also called a node) performs an aggregation for a corresponding disk drive (some part of a database is stored thereon).
(2) As a second phase, the results of the respective processors are collected by a totaling processor to obtain the final result.
This method is also described in Japanese Unexamined Patent Publication No. Hei 5-2610. In this publication, an aggregation for only one group is performed, and no consideration is given for aggregation for a plurality of groups, such as in Data Mining described above.
2. Rep Algorithm
The repartition algorithm is performed according to the following algorithm (FIG. 1).
(1) First, it is determined for which group each node performs the aggregation (step 110). In the example in Table 1, it is determined that a node 1 will handle product number G1 and a node 2 will handle product number G2.
(2) Then, each node reads part of data from a corresponding disk drive (a database is stored therein). If the read part is data for a group for which another node is to perform aggregation, the read part is transmitted to the responsible node (step 120). In the example in Table 1, if Table 1 is present on the disk drive for node 1, when node 1 reads the third tuple in Table 1 it will transmit the data in the third tuple to node 2.
(3) Finally, each node performs aggregation for the group to be aggregated in that node, including the data transmitted by the other nodes (step 130).
It is different which algorithm is faster depending on the conditions. FIG. 2 shows an estimate of the period of time required for the performance of one aggregation using each algorithm. The estimate in FIG. 2 was obtained with an IBM SP2 having 16 nodes (SP2 is a trademark of International Business Machines Corp.), and indicates the relationship existing between the number of groups and the response time. In this system, when the number of groups is smaller than 2.times.10.sup.5, the 2P algorithm provides faster processing. When the number of groups is greater, the Rep algorithm provides faster processing. An algorithm for dynamically switching the 2P algorithm and the Rep algorithm has been proposed (for example, see "Adaptive Parallel Aggregation Algorithms," Ambuj Shatdal and Jeffrey F. Naughton, in Proceedings Of The ACM SIGMOD Conference On The Management of Data, pp. 104-114, May 1995).
Another method is a method that involves the broadcasting of all the records in a database (see "Parallel Algorithms For The Execution Of Relational Database Operations," Dine Bitton, Haran Boral, David J. DeWitt and W. Kevin Wilkinson, ACM Trans. on Database Systems, 8(3):324-353, Sep. 1983). Such an algorithm (hereinafter called a BC algorithm), however, is impractical when a network for connecting processors is slow.
The BC algorithm will be generally explained below.
3. BC algorithm
(1) It is determined for which group in all the groups, each node performs the aggregation. This is the same as for the Rep algorithm.
(2) Each node broadcasts all the data on its disk drive to all the other nodes.
(3) Each node performs an aggregation for responsible groups relative to the broadcast data (including data on the disk drive of that node).
The algorithms with which one aggregation is performed in parallel have been explained. When one processor is used to perform a plurality of aggregations, a method has been employed for increasing the speed for the entire processing that is performed by adjusting the order of the calculations or by employing the correlation of calculations (see, for example, "On The Computation Of Multidimensional Aggregates," Sameet Agrawal, Rakesh Agrawl, Prasad M. Deshpande, Ashish Gupta, Jeffrey F. Naughton, Raghu Ramakrishnan and Sunita Sarawagi, In proceedings of the 22nd VLDB Conference, Sep. 1996). However, a method for simultaneously processing a plurality of aggregations by using a parallel computer has not been proposed.
A process for performing a plurality of aggregations in parallel can not always be performed at a high speed by repeating the above described method multiple times. It is, therefore, one object of the present invention to provide a method for performing a plurality of aggregations in parallel and at a high speed.
A plurality of aggregations must be performed for OLAP and for Data Mining, which are technical cores for a decision support system on which interest has recently been focused. For example, a Data Cube operator for performing a plurality of aggregations, relative to the analysis of data having a multiple-dimensional attribute, has been proposed (for example, see "Data Cube: A Relational
Aggregation Operator Generalizing Group-by, Cross-by and Sub-by, and Sub-totals," Jim Gray, Adam Bosworth, Andrew Layman and Hamid Pirahesh, Technical Report, Microsoft, November 1995). An application of Data Mining automatically finds the relationship between the attributes by using the results obtained by a plurality of aggregations and then prepares a graphical display (for example, see "Data Mining Optimized Association Rules For Numeric Attributes," Takeshi Fukuda, Yasuhiko Morimoto, Shinichi Morishita and Takeshi Tokuyama, in Proceedings Of The Fifteenth ACM SIGACT-SIGMOD-SIGART Symposium On Principles Of Database Systems, pp. 182-191, June 1996, and "Data Mining Using Two-Dimensional Optimized Association Rules: Scheme, Algorithm And Visualization," Takeshi Fukuda, Yasuhiko Morimoto, Shinichi Morishita and Takeshi Tokuyama, in Proceedings Of The ACM SIGMOD Conference On Management Of Data, pp. 13-23, June 1996). In these techniques, the interactive operations are required and the response time is an important element. To reduce the response time, the advance performance of aggregation is considered as one method. Thus, it is another object of the present invention to increase the speed for the execution of OLAP and Data Mining by performing a plurality of aggregations in parallel and at a high speed.
It is an additional object of the present invention to switch methods for executing a plurality of aggregations, depending on hardware conditions and on the properties of a plurality of aggregations, so that under various conditions the plurality of aggregations can be executed at a higher speed than by using the same method constantly.