A database system (database) is a set of computer programs used by a user to store and retrieve data. Such data is often limited to text (letters, numerals, symbols, and other characters), but may include any data that may be stored by a computer. Most databases store data in tables. A table is a series of rows, also called records. Each row contains data about a particular thing, such as a merchandise order. The rows are divided into columns. The intersection of a row and column is referred to as a field. Each column specifies a particular type of data that is contained in each field of the column. Each field contains the data of the particular type for the intersecting row and column. Databases typically return data from a table one row at a time, though it is common for them to actually retrieve data in larger quantities, reflecting the probability that they will be requested to return succeeding rows from the same table. This technique is called buffering.
FIG. 1 is a table diagram showing the contents of a sample Orders table containing data about each of several sales orders. The Orders table 100 has seven rows, each containing information about a different sales order. The Orders table has columns as follows: a Division column 101 containing the name of the division that took each order, a Salesperson column 102 containing the last name of the salesperson who took each order, a Sales Price column 103 containing the price for each order, and an Order Date column 104 which contains the date on which each order was placed. The Orders table also contains a bookmark column 110, which contains a value uniquely identifying each row in the table that can be used to quickly locate and retrieve the row. Such a value is known as a bookmark. It should be appreciated the size of the Orders table has been severely constrained to provide a manageable example. It is common for actual tables of this type to contain thousands, or even millions, of records, each 200 bytes or more long.
Most databases also have a query processor to manipulate data stored in one or more tables in response to instructions, called "queries," from users. One way in which a query processor can manipulate data stored in a table is to aggregate the data. Aggregation is an operation in which the query processor generates a result table that contains the results of a computation performed on the data in a source table, such as the Orders table. An aggregation is defined by specifying a source table, and, within the source table, an aggregated column and a grouping column. The result table contains columns corresponding to the aggregated column and grouping column of the source table. The result table contains one row for each unique value in the grouping column of the source table. Each row contains the unique grouping column value and the aggregate of all of the aggregated column values for the rows of the source table having the unique grouping column value. Aggregates are formed using aggregation functions such as "sum, " "average," "standard deviation," "number," "minimum," or "maximum."
As an example of aggregation, the Sales Price column 103 of the Orders table may be aggregated over the Division column 101 using the "sum" aggregation function. In this aggregation, the Division column 101 is the grouping column, and the Sales Price column 103 is the aggregated column, which is aggregated using the "sum" function--that is, for each unique value of the Division column, all of the Sales Price column values are added. In Structured Query Language (SQL), this aggregation is expressed as follows: EQU SELECT sum([Sales Price]) FROM [Orders] GROUP BY [Division]
FIG. 2 is a table diagram showing the result table formed by the sample aggregation. The result table 200 contains a Division column 201 and a sum of Sales Price column 203, as well as a bookmark column 210, which contains a value uniquely identifying each row in the result table that can be used to quickly locate and retrieve the row. The result table contains one row for each distinct value in the division column 101 of the source table. For each row, the Division column 201 field contains the name of one of the divisions, and the sum of Sales Price column 203 field contains the sum of the Sales Price column 103 field values for all of the rows of the source table having that division value. For instance, the first row of the result table contains the division name "retail" in the Division column 201 field, and the value "$134.00" in the sum of Sales Price column 203 field--that is, the sum of sales prices from the first, fifth, and sixth rows of the source table, each of which has the value "retail" in the Division column 101 field.
The conventional technique for performing aggregations such as the one described above involves reading the entire source table, which can take a considerable period of time for source tables of typical sizes. According to the conventional aggregation technique, the query processor causes the database to return each row of the source table in succession. For each row of the source table, the query processor first reads the contents of the grouping column field from the returned row and identifies the corresponding row of the result table. The query processor then reads the contents of the aggregated column field from the returned row and aggregates these contents into the aggregated column field of the identified row of the result table. These steps are repeated for every row of the source table. Conventional techniques that read every row of the source table are often impractical, since reading an entire data table usually imposes a large demand on disk access services, resulting in relatively long execution time for the aggregation.