1. Technical Field
This disclosure generally relates to database query optimizations, and more specifically relates to a query optimizer that rewrites a query to take advantage of multiple nodes and multiple network paths in a parallel computer system.
2. Background Art
Databases are computerized information storage and retrieval systems. A database system is structured to accept commands to store, retrieve and delete data using, for example, high-level query languages such as the Structured Query Language (SQL). The term “query” denominates a set of commands for retrieving data from a stored database. The query language requires the return of a particular data set in response to a particular query.
Execution of a database query can be a resource-intensive and time-consuming process. A query optimizer is used in an effort to optimize queries to make better use of system resources. In order to prevent an excessive drain on resources, many databases are also configured with query governors. A query governor prevents the execution of large and resource-intensive queries by referencing a defined threshold. If the cost of executing a query is predicted to exceed the threshold, the query is not executed.
Many large institutional computer users are experiencing tremendous growth of their databases. One of the primary means of dealing with large databases is that of distributing the data across multiple partitions in a parallel computer system. The partitions can be logical or physical over which the data is distributed. Prior art query governors have limited features when used in parallel computer systems. The query governors do not consider network resources of multiple networks in a parallel system with a large number of interconnected nodes.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack. The Blue Gene/L supercomputer communicates over several communication networks. The compute nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer.
Database query optimizers have been developed that evaluate queries and determine how to best execute the queries based on a number of different factors that affect query performance. However, none of the known query optimizers rewrite a query or optimize query execution for queries on multiple networks. On parallel computer systems in the prior art, the query optimizer is not able to effectively control the total use of resources across multiple nodes with one or more networks. Without a way to more effectively optimize queries, computer systems administrators will continue to have inadequate control over database queries and their use of system resources.