Relational database systems are a type of database or database management system that stores information in tables--rows and columns of data--and conducts searches by using data in specified columns of one table to find additional data in another table. Typically, the rows of a table represent records (collections of information about separate items) and the columns represent fields (particular attributes of a record). An example of a relational database system is the SQL (Structured Query Language) Server database system manufactured and sold by Microsoft Corporation.
A particular type of query that is frequently used in relational database systems is to group records according to a particular parameter. A query of this type is often referred to as a "GROUP BY" query. As an example, suppose an administrator wants to tally a billing total for a particular customer from a number of invoices. Each record contains a customer identification number and a dollar amount. The administrator might formulate a GROUP BY query that groups the invoices according to customer ID and then subsequently tallies the dollar amounts on the invoices for each customer.
The data records are commonly stored on disk arrays or other forms of non-volatile memory. Queries performed on relational databases, such as the GROUP BY query, can be more easily accommodated if all of the data records are loaded into volatile memory (i.e. RAM) for processing. However, relational databases often contain large amounts of data, which far surpasses the volatile memory resources. As a result, records are handled in volatile memory in batch, and then written to disk as they are processed.
FIG. 1 shows a traditional approach to evaluating a GROUP BY query, which uses sorting and aggregation techniques. A data stream 20 has a series of data records that represent invoices. Each data record has a customer ID (the top number) and a dollar amount (the bottom number). For convenience, suppose that only two records can fit into volatile memory at a time. As a batch 22 of two records is loaded into volatile memory, the query process sorts the two records according to customer ID so that the highest numbered customer ID is first and the lowest numbered customer ID is second. The sorted batch, usually called a run, is then written to the storage disks. The process is repeated for each batch of two records to produce the data set 24 storing the runs.
Next, the query process reads the sorted runs from disk and merges them to produce the data set 26. The records in the merged data set 26 are now sorted according to customer ID. The query process then aggregates records with like customer IDs. Since the GROUP BY query is interested in a final billing total for each customer based on one or more invoices, all records having the same customer number can be aggregated and represented as a single record having the customer ID and a dollar value equal to the sum of the amounts in the individual records. Here, the two records for customer ID of 60 are combined into a single record having the customer ID of 60 and a tallied dollar amount of $86 (i.e., 54+32). Similarly, the two records for customer ID of 52 are also combined into a single record. The aggregation phase produces a data set 28.
Notice that the amount of data handled throughout the query process remains essentially constant until the final aggregation stage. It would be beneficial to reduce the amount of data earlier in the process. One prior art technique for achieving this goal is known as "early aggregation," in which like data records found in initial batches are combined early on, before the records are sorted and merged.
FIG. 2 shows a conventional query process, which utilizes an early aggregation technique. Batch 30 of the data stream 20 contains two records with the same customer ID of 60. Using an early aggregation process, the records are combined in volatile memory to form a single record with a summed dollar value. The result of the early aggregation process is a reduced data set 32, which has one less record than stream 20. The reduced data set 32 is then processed according to the same sorting, merging and aggregating processes described above for FIG. 1 to produce data sets 34, 36, and 38, respectively.
Early aggregation reduced the number of data records handled during the query process. In this simple example, only one record is eliminated early on; however, significant savings can be achieved when processing large numbers of data records.
A separate technique that is often used in large-scale queries is known as "hash partitioning." This technique involves hashing the data records according to a query parameter to put the records into smaller work files. The early aggregation, sorting, and merging can then be applied to the smaller work files.
FIG. 3 shows a conventional query process, which uses in part a hash partitioning process to place the records in smaller work groups. In FIG. 3, the data records are partitioned into one of five buckets according to the last digit in the customer ID. For instance, bucket 0, 5 is for all records in which the last digit of the customer ID is a zero or five. Each bucket has a dedicated piece of memory sized to hold two records.
The records are placed in the appropriate buckets according to the hashing function. Record 52, 10 is placed in bucket 2, 7 because the customer ID of 52 has a last digit of 2; record 77, 44 is placed in bucket 2, 7; and record 65, 12 is placed in bucket 0, 5. The next record 52, 30 has the same customer ID as the first record. An early aggregation process can be used at the bucket level to aggregate this record with the earlier record.
FIG. 4 shows the result of placing the record 52, 30 into bucket 2, 7. The records 52, 10 and 52, 30 are combined to form a single record 52, 40, which remains in the bucket 2, 7. Similarly, the next two data records 60, 54 and 60, 32 are combined in bucket 0, 5 to form a single record 60, 86. Notice that both buckets 0, 5 and 2, 7 are full. When a next record is received that cannot be aggregated with an existing record, an overflow condition is reached. To free up memory space in the bucket, one of its records is written to an output buffer. When the output buffer becomes full, its records are posted to a temporary file on the storage disk. Typically, the bucket memory operates on a first in, first out (FIFO) policy, and hence the first data record written into the memory is the one selected to be written out to the output buffer when the memory becomes full.
FIG. 5 shows the effect of adding a record to each of the buckets 0, 5 and 2, 7, thereby causing an overflow condition for each bucket. The record 65, 12 from bucket 0, 5, along with the record 52, 40 from bucket 2, 7, are written to the output buffer. The records remain in the output buffer until that buffer is full. Once full, the entire contents of the output buffer file are flushed to a temporary file on the storage disk to free the output buffer for more overflow records.
FIG. 6 shows an alternative construction in which an output buffer and associated temporary file on disk are allocated for each bucket. In this structure, the overflow records are posted to the output buffer associated with the bucket. As the output buffer for a corresponding bucket becomes full, its contents are written out to the associated temporary file on disk.
When the system reaches the end of the input data stream, the system repeats the aggregation process on the temporary files to further combine data records. The process is completed when all combinable data records are aggregated in main memory to produce a final output.
While these conventional techniques are effective, they do not fully capitalize on the benefits of early aggregation or efficiently use the allocated memory within the partitions. The inventor has developed a query process and data structure that overcomes these drawbacks.