1. Field of the Invention
The present invention generally relates to data processing and, more particularly, to executing queries against a partitioned database.
2. Description of the Related Art
Databases are computerized information storage and retrieval systems. A relational database management system (RDBMS) is a database management system (DBMS) that uses techniques for storing and retrieving data. The most prevalent type of database is the relational database, a tabular database in which data is defined so that it can be reorganized and accessed in a number of different ways.
Databases are typically partitioned to improve availability, performance, and scalability. Partitioning a database involves dividing the database or its constituent elements into distinct individual parts. For example, a database may be partitioned by building smaller separate databases, each with its own tables, indexes, transaction logs, etc. or by splitting a selected element, for example a field of a table. The database may be partitioned within a single server, or distributed or replicated across multiple servers. Therefore, database partitioning provides multiple benefits including scalability to support large databases, the ability to handle complex workloads, and increased parallelism.
When queries are run against a partitioned database, the query may be run against each partition. The results from each partition may then be integrated to provide a complete query result for the query. To further improve performance of querying a database, the query may not be run against one or more partitions which are known to not contain results for the query. For example, an underlying database may be partitioned based on location. The locations, for example, may be divided into 4 partitions, each partition being associated with data from one of the eastern states, western states, northern states, and southern states.
If a query containing a condition STATE=‘MAINE’ is run against the underlying database, the query need not be run against partitions containing data for southern and western states. Therefore, by eliminating the number of partitions against which a query is executed, the performance may be improved. However, even with elimination of partitions, the query may still be run against multiple partitions. For example, the above query may be executed against the northern states partition and the eastern states partition.
One problem with running a query against multiple partitions is that result sets from different partitions may include redundant results. For example, assume that the above query is run against the partitions containing data for the northern and eastern states in order to identify different age groups of individuals driving a particular automobile. Assume further that data related to ten distinct age groups is included with the underlying database and that the eastern states partition is slower than the northern states partition. In other words, determination of a result set from the eastern states partition takes significantly longer than determination of a result set from the northern states partition. Assume now that a first result set is quickly returned from the northern states partition having data for all ten age groups. Accordingly, a second result set returned from the eastern states partition would include redundant data, as all age groups are already identified in the first result set. Thus, execution of the query against the underlying database is inefficient because the complete result set for the above query is not returned to the user until the results from the slower partition, i.e., the eastern states partition, are available. Furthermore, a significant amount of time may be wasted while waiting for the slower partition to retrieve results. Therefore overall query throughput may be adversely affected.
Therefore, there is a need for an efficient technique for managing execution of queries against partitioned databases.