Structured Query Language (SQL) is a standardized language for accessing and updating relational databases. SQL provides a mechanism for creating and maintaining tables, each of which contains rows and columns of information that are collectively assembled into a database. Ideally, the tables are xe2x80x9cnormalizedxe2x80x9d in that the structure of the tables avoids data redundancy and allows the resulting data model to be mapped to many different physical database designs. In order to avoid redundancy, yet still be able to display data from multiple tables, SQL provides a mechanism called a xe2x80x9cjoin.xe2x80x9d
In a join, rows from one table are combined with rows from another table. The rows selected to be combined are determined by a xe2x80x9cjoin condition,xe2x80x9d which causes a comparison of one column from each of the tables being joined. In a massively parallel processing system in which the rows of each of the tables are distributed among a large number of data storage facilities, the database system xe2x80x9credistributesxe2x80x9d one of the tables among the data storage facilities prior to performing the join. The redistribution involves retrieving the table from the data storage facilities and then distributing the rows of the table on the data storage facilities using one side of the join condition as the basis for the distribution. If the other table in the join is distributed using the other side of the join condition as the basis for its distribution, the join for each row will be performed using a single data storage facility.
In one particular type of join query, called an xe2x80x9caggregate join query,xe2x80x9d data in a column of one of the tables being joined is aggregated (i.e., summed, averaged, counted, etc.). Performing such a query normally requires all the joins to be performed in order for the aggregation processing to be done on the join results.
In general, in one aspect, the invention features a method for optimizing aggregate join queries to a database. The query specifies one or more data-to-be-aggregated columns (DAC), one or more explicit aggregation keys (EAK) and one or more join conditions. The query involves table T1. T1 includes the one or more data-to-be-aggregated columns and an aggregation key. All join conditions that involve T1 join the aggregation key of T1 to the unique primary index of another table that is involved in the query. T1 is joined to a table T2 in accordance with a join condition of the aggregate join query. The joining process includes sorting T1 on a sort key that is the aggregation key. Aggregation processing is combined into the step of sorting if T1 is a significant table. Aggregation processing is also combined into the step of sorting if T2 is a significant table and the first join condition is between the aggregation key of T1 and the unique primary index of T2.
Implementations of the invention may include one or more of the following. T1 can be an intermediate join result IJR. T1 can also be a database table specified in the query. The aggregation key of T1 can be the EAK of a significant table. The aggregation key of T1 can also be a new aggregation key derived from a significant table. Determining if a table is a significant table can include checking if only one table includes the one or more EAKs. If so, then that table is a significant table. Determining if a table is a significant table can include checking if the set of tables TS that includes all tables that contain at least one of the one or more EAKs, has a member table for which the query includes a join condition between that member table and the unique primary index of every other table in TS. If so, then that member table is a significant table.
In general, in another aspect, the invention features a database system in which an aggregate join query can be optimized. The query specifies one or more data-to-be-aggregated columns (DAC), one or more explicit aggregation keys (EAK) and one or more join conditions. The query involves table T1. T1 includes the one or more data-to-be-aggregated columns and an aggregation key. All join conditions that involve T1 join the aggregation key of T1 to the unique primary index of another table that is involved in the query. The database system includes a massively parallel processing system. That massively parallel processing system includes one or more nodes and a plurality of CPUs, with each of the one or more nodes providing access to the one or more CPUs. Also included are a plurality of processes each of the one or more CPUs providing access to one or more virtual processes with each process configured to manage data stored in one of a plurality of data-storage facilities. The rows of each of the tables specified in the query are distributed among the plurality of data-storage facilities based on respective primary indices for each table. The system also includes a database-management component configured to execute the query in a manner where T1 is joined to a table T2 in accordance with a join condition of the aggregate join query. The joining process includes sorting T1 on a sort key that is the aggregation key. Aggregation processing is combined into the step of sorting if T1 is a significant table. Aggregation processing is also combined into the step of sorting if T2 is a significant table and the first join condition is between the aggregation key of T1 and the unique primary index of T2.
In general, in another aspect, the invention features a computer program, stored on a tangible storage medium, for use in optimizing aggregate join queries to a database. The query specifies one or more data-to-be-aggregated columns (DAC), one or more explicit aggregation keys (EAK) and one or more join conditions. The query involves table T1. T1 includes the one or more data-to-be-aggregated columns and an aggregation key. All join conditions that involve T1 join the aggregation key of T1 to the unique primary index of another table that is involved in the query. The computer program includes executable instructions that cause the computer to join T1 and a table T2 in accordance with a join condition of the aggregate join query. The joining process includes sorting T1 on a sort key that is the aggregation key. The instructions cause the computer to combine aggregation processing into the step of sorting if T1 is a significant table. The instructions also cause the computer to combine aggregation processing into the step of sorting if T2 is a significant table and the first join condition is between the aggregation key of T1 and the unique primary index of T2.