1. Technical Field
This invention relates to relational database systems. More specifically, the invention relates to estimating the size of a Group-By operation in a relational database.
2. Description Of The Prior Art
Relational database systems store large amounts of data, including business data that can be analyzed to support business decisions. Typically, data records in a relational database management system in a computing system are maintained in tables, which are a collection of rows all having the same columns. Each column maintains information on a particular type of data for the data records which comprise the rows. Tables that are accessible to the operator are known as base tables, and tables that store data that describe base tables are known as catalog tables. The data stored in the catalog table is not readily visible to an operator of the database. Rather the data stored in the catalog table pertains to meta-data. In the case of a database, the meta-data stored in the catalog table describes operator visible attributes of the base table, such as the names and types of columns, as well as statistical distribution of column values. Typically, a database includes catalog tables and base tables. The catalog tables and base tables function in a relational format to enable efficient use of data stored in the database.
A relational database management system uses relational techniques for storing, manipulating, and retrieving information, and is further designed to accept commands to store, retrieve, and remove data. Structured Query Language (SQL) is a commonly used and well known example of a command set utilized in relational database management systems, and shall serve to illustrate a relational database management system. An SQL query often includes predicates, also known as user specified conditions. The predicate are used to limit query results. One common operation in an SQL query is a Group-By operation where data is segmented into groups and aggregate information is derived for these groups. The Group-By operation partitions a relation into non-overlapping sets of rows from one or more tables, and then mathematically manipulates separately over each set. The number of results produced by a Group-By operation depends on the number of non-overlapping sets of rows, which in turn depends on the number of columns of the Group-By operation.
In most database systems, a cost-based query optimizer uses query predicates to estimate resource consumption and memory requirements in determining the most efficient query execution plan. Because resource consumption and memory requirements depend on the number of rows that need to be processed, knowledge of the number of rows resulting from each sequence of operators in a query plan is important. Such operators may include: scan, which looks at one table and provides a stream of rows from the table; join, which combines two streams of rows that have been scanned into one stream; equal join, which only joins rows together that satisfy an equality condition; group, which segments rows from an input stream and puts aggregated rows into an output stream; and sort, which sorts an input stream according to user specification to produce an output stream in order. A grouping operation gathers together rows having the same value on specified columns to produce a single row. Accordingly, a Group-By operation needs to look at all input rows before producing a result.
Immediate Group-by results are stored in memory. Memory requirements for a Group-By operation are determined by an estimate of the result size of the Group-By operation. Accurate estimation of a result size from a Group-By operation is important in estimating the memory requirement of the operation. Failure to allocate sufficient memory for the Group-By operation will require an overflow of the memory to disk, which will reduce the efficiency of a query execution. However, if more than a necessary amount of memory is allocated to the Group-By operation, the amount of memory available for other concurrent operations will be reduced, and thereby reduce the efficiency of the entire system. Accordingly, an increased accuracy in estimation of the result size of a Group-By operation will improve the usage of available memory and thereby increase the overall efficiency of the system.
There is therefore a need for an efficient and accurate method of estimating a result size of a Group-By operation in order to more accurately predict system memory requirements.