1. Technical Field
This disclosure generally relates to database query execution and optimization, and more specifically relates to query execution and optimization with autonomic error recovery from network failures in a parallel computer system of multiple nodes and multiple networks.
2. Background Art
Databases are computerized information storage and retrieval systems. A database system is structured to accept commands to store, retrieve and delete data using, for example, high-level query languages such as the Structured Query Language (SQL). The term “query” denominates a set of commands for retrieving data from a stored database. The query language requires the return of a particular data set in response to a particular query.
Many large institutional computer users are experiencing tremendous growth of their databases. One of the primary means of dealing with large databases is that of distributing the data across multiple partitions in a parallel computer system. The partitions can be logical or physical over which the data is distributed.
Massively parallel computer systems are one type of parallel computer system that have a large number of interconnected compute nodes. A family of such massively parallel computers is being developed by International Business Machines Corporation (IBM) under the name Blue Gene. The Blue Gene/L system is a scalable system in which the current maximum number of compute nodes is 65,536. The Blue Gene/L node consists of a single ASIC (application specific integrated circuit) with 2 CPUs and memory. The full computer is housed in 64 racks or cabinets with 32 node boards in each rack. The Blue Gene/L supercomputer communicates over several communication networks. The compute nodes are arranged into both a logical tree network and a 3-dimensional torus network. The logical tree network connects the computational nodes so that each node communicates with a parent and one or two children. The torus network logically connects the compute nodes in a three-dimensional lattice like structure that allows each compute node to communicate with its closest 6 neighbors in a section of the computer.
In massively parallel computer systems such as the Blue Gene parallel computer system, recovering from hardware failures is important to more efficiently utilize the computer system. Recovering from a failure may allow a sophisticated application to continue to operate on different portions of the system or at a reduced speed to prevent the total loss of accumulated data prior to the failure that would result from restarting the system.
Database query optimizers have been developed that evaluate queries and determine how to best execute the queries based on a number of different factors that affect query performance. In the related applications, a query optimizer rewrites a query or optimizes query execution for queries on multiple networks. On parallel computer systems in the prior art, the database and query optimizer are not able to effectively overcome a failure of a network while executing a query. Without a way to more effectively execute and optimize queries, multiple network computer systems will continue to suffer from inefficient utilization of system resources to overcome network failures and process database queries.