1. Field of the Invention
This invention relates in general to database systems, and, in particular, to the determination of records with a specified number of largest or smallest values in a parallel database system
2. Description of Related Art
Databases are computerized information storage and retrieval systems. A Relational Database Management System (RDBMS) is a database management system (DBMS) which uses relational techniques for storing and retrieving data. Relational databases are organized into tables which consist of records and fields of data. The records are formally called tuples. A database will typically have many tables and each table will typically have multiple tuples and multiple fields. The tables are typically stored on random access storage devices (RASD) such as magnetic or optical disk drives for semi-permanent storage.
RDBMS software using a Structured Query Language (SQL) interface is well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American National Standards Institute (ANSI) and the International Standards Organization (ISO). The SQL interface allows users to formulate relational operations on the tables either interactively, in batch files, or embedded in host languages, such as C and COBOL. SQL allows the user to manipulate the data.
One of the most common SQL queries executed by the RDBMS software is the SELECT statement. In SQL, the typical SELECT statement may comprise the format: xe2x80x9cSELECT  less than clause greater than  FROM  less than clause greater than  WHERE  less than clause greater than  GROUP BY  less than clause greater than  ORDER BY  less than clause greater than .xe2x80x9d The clauses generally must follow this sequence, but only the SELECT and FROM clauses are required. The result of a SELECT statement is a subset of data retrieved by the RDBMS software from one or more existing tables stored in the relational database, wherein the FROM clause identifies the name of the table or tables from which data is being selected. The subset of data is treated as a new table, termed the result table, which typically comprises a temporary table. In general, the items specified in the SELECT clause of the SELECT statement determine the fields that will be returned in the result table from the table(s) identified in the FROM clause.
The WHERE clause determines which records should be returned in the result table. Generally, the WHERE clause contains a search condition that must be satisfied by each record returned in the result table. The records that meet the search condition form an intermediate set, which is then processed further according to specifications in the SELECT clause. The search condition typically comprises one or more predicates, each of which specify a comparison between two values comprising fields, constants or correlated values. Multiple predicates in the WHERE clause are themselves typically connected by Boolean operators.
The GROUP BY clause determines which records should be returned in the result table. Generally, the GROUP BY clause contains a search condition that causes the records in an intermediate result set to be grouped according to the values in the field(s) specified by the GROUP BY clause. The GROUP BY clause follows the WHERE clause, if there is one, and is most commonly used when the SELECT clause contains one or more field functions.
The sequence in which the records of the result tables are presented can be specified by using the ORDER BY clause. The ORDER BY clause causes the records to be returned in the order of the field name specified by the ORDER BY clause. The ORDER BY clauses may be followed by the clauses DESC (descending) and ASC (ascending) to indicate the order of a particular set of records.
When retrieving the maximum and/or minimum values from a set of data stored in a table, users often include an ORDER BY clause in the SELECT statement. The maximum value is the largest value in the set of data and the minimum value is the smallest value in the set of data. For example, assume a set of data contains the following values: 1, 2, 3, 4, 5, 6, 7, 8, 9, and 10. The maximum value is 10 and the minimum value is 1.
Often times users may desire to identify records that contain a specified number of the largest or smallest values from a set of data stored in a table, instead of just computing the maximum and minimum values. The specified number of largest or smallest values are referred to as extrema values. Referring to the example above, assume a user desires to identify the two largest values and the two smallest values. The two largest values are 9 and 10, and the two smallest values are 2 and 1.
In an exemplary SQL request, the number of largest or smallest values is specified by defining the following: a value n that represents the number of largest or smallest values; an optional grouping expression; and an order specification (i.e., an ORDER BY clause) that specifies either an ascending or a descending order. An exemplary SQL request is shown below.
SELECT FIRST (10) stores, prod, sales
FROM sales_table
GROUP BY store
ORDER BY sales DESC.
In this example, the SELECT statement contains FROM, GROUP BY, and ORDER BY clauses. The SELECT and the FROM clauses identify the number of records, the fields of the table, and the table from which the records are to be retrieved. In particular, the number of records is 10, and the fields of the sales_table are stores, prod, and sales. The fields are grouped relative to the store field, and the fields are ordered in descending order relative to values of the sales field.
An alternative example is shown below:
SELECT store, prod, sales
FROM sales_table
GROUP BY store
QUALIFY RANK (sales DESC) less than =10.
This example is similar to the first example. The difference between the first and second example is that a QUALIFY RANK clause is used instead of an ORDER BY clause. The QUALIFY RANK clause is a clause that was introduced in the Teradata7 Version 2 Release 3 extended SQL, offered by NCR Corporation in Dayton, Ohio, to support the filtering of records based on function computations, such as RANK, Cumulative Totals, etc. Hence, QUALIFY RANK (sales DESC) less than =10 requests records that satisfy the qualification criterion RANK (sales DESC) less than =10. It is noted that in some cases, RANK may not cause the physical ordering of records. For example, the RANK values may be computed by ordering the records relative to the value of the sales field.
Another technique for determining n extrema values includes the following single processor large memory technique. Assume that a computer system is a single processor system that has an unlimited (or infinite) amount of memory. With this computer system, a user could determine the records containing the n extrema values, for each grouping expression, in one pass as follows: (1) track the current set of n extrema values; (2) read the next value from the next record; (3) compare the next value with the current set of n extrema values; and (4) update the current set of n extrema values with the new value, when the new value is larger than the nth largest value. The main problem with this technique is that it requires an unlimited amount of memory. In particular, a large amount of memory is required when a large number of grouping expressions exist. The large memory requirement makes this technique impractical in a multiuser, single processor system. The large memory requirement also makes this technique impractical in a multi-user, multi-processing unit environment of a parallel database system.
Like the single processor large memory technique discussed above, many of the traditional techniques for computing n extrema values require a large memory. These other traditional techniques frequently involve either (1) ordering an entire set of records (i.e., a table) relative to a specified value, and fetching only the records with the first n extrema values; or (2) when a rank feature is available in the database system, using the rank feature to rank all of the values, and returning the records with the first n ranking values (also referred to as the n extrema values). In both cases, the database system has to order all of the records. Consequently, the request processing is very inefficient and slow.
Some of the traditional systems also lack the ability to accurately rank duplicate records. When duplicate values exist, traditional systems may fail to calculate the n extrema values because several records may have the same rank. For example, if n equals 10 and the set of extrema values are as follows: {10, 9, 9, 9, 9, 9, 8, 8, 8, 8, 8, 8, 8, 7, 7, 5, 4, 4, 3, 2, 1}.
In some traditional systems, only the first ten records, {10, 9, 9, 9, 9, 9, 8, 8, 8, 8}, are produced. In these traditional systems, even if one million records had the same (or duplicate) value, the system would only return ten records.
Thus, there is a need in the art for an improved technique of computing the n extrema values that uses less memory and incorporates all duplicate records into the result.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for selecting one or more of the highest ranking records in a database system that has one or more processing units.
At each processing unit, a candidate list of records is generated in a memory space of the processing unit. Each record contains a field value. The generated candidate list of records is ranked based on the field value as each new candidate is added to the generated candidate list. When the memory space of the processing unit is insufficient, the ranked candidate list of records are transferred from the memory space of the processing unit to a data store. At the data store, a predefined number of extrema values are identified by using the transferred candidate list of records.