The present invention relates to aggregation operations and, more particularly, to parallelizing distinct aggregation operations.
In a database management system (DBMS), data is stored in one or more data containers, each container contains records, and the data within each record is organized into one or more fields. In relational database systems, the data containers are referred to as tables, the records are referred to as rows, and the fields are referred to as columns. In object oriented databases, the data containers are referred to as object classes, the records are referred to as objects, and the fields are referred to as attributes. Other database architectures may use other terminology.
The present invention is not limited to any particular type of data container or database architecture. However, for the purpose of explanation, the examples and the terminology used herein shall be that typically associated with relational databases. Thus, the terms xe2x80x9ctablexe2x80x9d, xe2x80x9crowxe2x80x9d and xe2x80x9ccolumnxe2x80x9d shall be used herein to refer respectively to the data container, record, and field.
In typical database systems, users write, update and retrieve information by submitting commands to a database application. To be correctly processed, the commands must comply with the database language that is supported by the database application. One popular database language is known as Structured Query Language (SQL).
Multi-processing systems are typically partitioned into nodes, where each node may contain multiple processors executing multiple concurrent processes. To fully utilize the computing power of a multi-processing system, a database application may divide a large processing task required by a query into smaller work granules which may then distributed to processes running on one or more processing nodes. Because the various work granules are being performed in parallel, the processing required by the query can be completed much faster than if the processing were performed on a single node by a single process. One mechanism for implementing parallel operations in a database management system is described in U.S. patent application Ser. No. 08/441,527 entitled xe2x80x9cMethod and Apparatus for Implementing Parallel Operations in a Database Management Systemxe2x80x9d filed on May 15, 1995, by Gary Hallmark and Daniel Leary, incorporated herein by reference.
Unfortunately, not all types of operations can be efficiently performed in parallel. For example, distinct aggregation operations pose parallelization difficulties. A distinct aggregation operation is an operation in which some form of aggregation (e.g. SUM, COUNT, or AVERAGE) is performed on the results of a DISTINCT operation. A DISTINCT operation causes the elimination of duplicate values in specified sets of data. For example, the SQL query xe2x80x9cselect distinct deptno from empxe2x80x9d returns the set of unique departments xe2x80x9cdeptnoxe2x80x9d from the table xe2x80x9cemp.xe2x80x9d Even if a particular department number appears in fifty rows of the table xe2x80x9cempxe2x80x9d, it will only appear once in the result set of the query.
The parallelization difficulties posed by distinct aggregation operations may be illustrated with reference to the following query (Q1), where emp identifies table 100 shown in FIG. 1, which has columns xe2x80x9cregionxe2x80x9d, xe2x80x9cempnoxe2x80x9d, xe2x80x9cagexe2x80x9d and xe2x80x9cmgrxe2x80x9d:
select count (distinct mgr)
from emp
group by region
In this query, the column xe2x80x9cregionxe2x80x9d is referred to as the xe2x80x9cgroup byxe2x80x9d column because it contains the values that are used as the basis for forming the rows into groups. Thus, all rows with the same xe2x80x9cregionxe2x80x9d value belong to the same group. On the other hand, the column xe2x80x9cmgrxe2x80x9d is referred to as the xe2x80x9cdistinct-keyxe2x80x9d column because it is the column that holds the values that are involved in the duplicate elimination. Thus, within each region group, rows are eliminated if their xe2x80x9cmgrxe2x80x9d value duplicates the xe2x80x9cmgrxe2x80x9d value of another row in the group.
Query Q1 returns the number of distinct managers there are in each region. During execution of this query, the database server (1) groups together rows by region value, (2) eliminates, within each region group, rows that have duplicate manager values, and (3) counts how many rows remain in each region group after the duplicates have been removed. In table 100 illustrated in FIG. 1, this operation results in the values: 2 for region N, 2 for region S, 1 for region E, and 1 for region W.
For accurate results, the duplicate elimination is performed after the group by function and before the aggregate operation. The duplicate elimination is performed after the group by function because the duplicate elimination requires the elimination of only those values that are duplicated within the same group. That is, a value is only a xe2x80x9cduplicatexe2x80x9d if it matches another value within the same group, regardless of whether the value matches values in other groups. Therefore, all duplicates cannot be identified until all rows that belong to a group have been grouped together.
The duplicate elimination is performed before the aggregate function because the aggregate function requires the aggregation of only those rows associated with nonduplicate distinct-key-column values. For example, if Q1 is executed without performing the distinct operation before the count operation, the results of the count operation would be: 3 for region N, 3 for region S, 1 for region E, and 1 for region W. Once the count operation has produced these results, it is not possible to perform duplicate elimination on those results to produce the correct results.
Parallelizing operations generally involves distributing tasks among slave processes. The set of all processes responsible for performing a particular stage of an operation is referred to as a xe2x80x9crow sourcexe2x80x9d. When an operation involves multiple tasks, the operation is often performed in stages using multiple row sources, where the results produced by one row source are provided as input to a subsequent row source.
Using conventional techniques, distinct aggregation operations are parallelized by dividing the operation into two stages and using two row sources. During the first-stage of the operation, the tables identified in the query are scanned by a first row source (a first set of slave processes). Each slave process in the first row source is assigned to scan a different table or portion of table. In Q1, only one table xe2x80x9cempxe2x80x9d is referenced, so each slave process in the first row source may be assigned to scan a different portion of the xe2x80x9cempxe2x80x9d table.
At the second-stage, a second row source receives and processes the rows produced by the first row source. Each slave process in the second row source corresponds to a group (or set of groups) formed by the xe2x80x9cgroup byxe2x80x9d statement. For example, the xe2x80x9cgroup byxe2x80x9d statement of Q1 groups by region. There are four distinct region values stored in table emp (N, S, E, W). Therefore, the second row source used to execute Q1 may include up to four slave processes (i.e. one slave process for each distinct region value).
After reading a row, each process in the first row source determines the group to which the row belongs, and sends the row to the slave process in the second row source that corresponds to that group. Consequently, each slave process in the second row source will receive the rows for the group assigned to it, regardless of which of the first row sources retrieved the rows. Thus, for example, the slave process in the second row source that is associated with region xe2x80x9cNxe2x80x9d receives rows 110, 116 and 122. Each slave process in the second row source eliminates duplicates from the rows it receives, and then counts how many non-duplicate rows remain in each group.
Unfortunately, this technique for parallelizing distinct aggregation operations has severe limitations. For example, the maximum degree of parallelism available during the second-stage of the process is limited to the number of groups formed by the xe2x80x9cgroup byxe2x80x9d statement. The number of groups thus formed may be significantly less than the number of processors available to participate in the operation. For example, in a system that includes 64 processors, 60 processors may remain idle while the remaining 4 are busy for an extended period of time while executing the four slave processes in the second row source.
Further, the fact that the degree of parallelism for the second row source is so limited has a substantial effect on the overall efficiency of the query processing because the bulk of the processing is performed by the second row source. As illustrated in the example given above, each process in the first row source (whose degree of parallelism is not limited) is merely responsible for reading rows and transmitting them to the appropriate slave process in the second row source. On the other hand, each process in the second row source is responsible for removing duplicates and executing the aggregation operation.
Based on the foregoing, it is clearly desirable to provide a mechanism and technique for executing distinct aggregate operations in parallel in a manner that is more scalable and efficient than current techniques.
Techniques are provided for executing distinct aggregation operations in a manner that is more scalable and efficient than prior techniques. For distinct aggregate operations that involve a single distinct-key column, a two stage parallelization technique is provided in which much of the processing responsibility is pushed down to the first-stage row source, whose degree of parallelism is not limited by the group-by values. Specifically, the processes in the first-stage row source form the rows they scan into groups, and eliminate duplicates from those groups. The groups are then passed to a second-stage row source where the duplicate elimination is completed, and the set aggregation is performed.
According to another aspect of the invention, a technique is provided for parallelizing distinct aggregation operations that involve multiple distinct-key columns, but do not require grouping. Much of the processing responsibility is pushed down to the first-stage row source, whose degree of parallelism is not limited by number of distinct values. Specifically, the processes in the first-stage row source split the rows they scan into xe2x80x9crow piecesxe2x80x9d, where each piece created from a row is associated with a different distinct-key column. The first-stage row source groups the row pieces based on the distinct-key column with which they are associated, and eliminates duplicates from those groups. The groups are then passed to a second-stage row source where the duplicate elimination is completed, and the set aggregation is performed.
According to another aspect of the invention, a three-stage technique is provided to parallelize aggregation operations that involve both grouping and multiple distinct-key columns. According to one embodiment, such queries are handled by splitting rows into as many pieces as there are distinct aggregates in the query, and processing the row pieces. During the first-stage, a set of slave processes scans the rows of the base tables and performs partial duplicate elimination. During the second-stage, a set of slave processes completes the duplicate elimination and performs partial set function aggregation. During the third-stage, a third set of slave processes completes the set aggregation to produce the results of the distinct aggregation operation.