1. Field of the Invention.
This invention relates in general to database management systems performed by computers, and in particular, to the optimization of queries by parallel execution using replicated and partitioned tables.
2. Description of Related Art.
Computer systems incorporating Relational DataBase Management System (RDBMS) software using a Structured Query Language (SQL) interface are well known in the art. The SQL interface has evolved into a standard language for RDBMS software and has been adopted as such by both the American Nationals Standard Institute (ANSI) and the International Standards Organization (ISO).
Achieving interactive response time for data and/or logic intensive queries in decision support, on-line analytical processing, and data mining applications of an RDBMS is a key challenge for commercial database management systems. Parallel query execution is the best hope for achieving this goal.
One method of achieving parallel query execution is through the exploitation of database replication and partitioning. The replicated portions or partitions of the database are known as distributions. Using these techniques, queries can be deconstructed into subtasks based upon the replication and/or partitioning of the database. These subtasks are executed by parallel instances of the RDBMS, wherein each subtask is executed by an instance that manages a distribution of the database. Typically, the results of these subtasks are merged for delivery to a requesting application.
Optimization choices regarding how queries are deconstructed into subtasks are determined by the distributions of the database. Often, the database has to be replicated or partitioned dynamically to satisfy the requirements of a given query operation. Such dynamic replication or partitioning is an expensive operation and should be optimized or avoided altogether.
There is a need in the art for general query optimization strategies that take into account prior replication or partitioning as a general distribution property of database tables and derived tables. Specifically, there is a need in the art for techniques that determine when parallel RDBMS operations can be carried out without data movement based on prior data movements.
To overcome the limitations in the prior art described above, and to overcome other limitations that will become apparent upon reading and understanding the present specification, the present invention discloses a method, apparatus, and article of manufacture for optimizing database queries. The query is analyzed to determine whether at least a portion of the query can be evaluated using a plurality of parallel operations without data redistribution. If so, then the most efficient query execution plan that uses these parallel operations is generated and executed.
Thus, it is an object of the present invention to take advantage of data that was previously replicated or partitioned across a plurality of nodes in the computer system. The data may have been distributed when a table was created, or redistributed as a result of an dynamic operation.
In addition, it is an object of the present invention to analyze a query by taking into account a distribution property of a data stream for an operation of the query. The distribution property describes a set of nodes that may contain tuples of the data stream, a distribution function used for assigning the tuples to nodes, and a distribution key to which the distribution function is applied.
It is also an object of the present invention to add an operator to a query execution plan (QEP) to dynamically change the distribution properties of the data streams in response to distribution requirements of operations within the QEP.
Another object of the present invention is to generate efficient QEPs for parallel execution by having basic operators understand how to handle input streams with replicated or partitioned distributions. This requires that the basic operators understand when their operations can be performed locally. It also requires that the basic operators compute the distribution property of the stream produced by their operations.
It is yet another object of the present invention to provide specific optimization techniques for joins, aggregations, subquery evaluations, set operations, error checking scalar subselects, and table function access.