The present invention relates to a database management system and more particularly to a database processing method which is suitable for a parallel query process suited to a relational database management system.
A database management system (hereinafter abbreviated to DBMS), particularly a relational DBMS processes a query which is represented in a non-procedural database language, decides the internal processing procedure, and executes the query process according to this internal processing procedure. As a database language, a database language which is regulated in Database Language SQL ISO 9075:1989 and called SQL is widely used. Among main conventional query processing methods, there are a method for deciding a single internal processing procedure on the basis of the predefined rule and a method for deciding an optimum procedure from a plurality of candidate processing procedures which are selected using various statistical information according to cost evaluation. In the case of the former, the load for generating the processing procedure is small, though there is a problem imposed in the propriety of the rules which are set uniformly and there is also a problem imposed in the optimization of the selected internal processing procedure.
The latter manages various statistical information, generates a plurality of candidate processing procedures, and calculates the load for cost evaluation for each of those procedures so as to select an optimum processing procedure. A technique obtained by combining the above two methods is indicated in, for example, Satoh, K., et. al. xe2x80x9cLocal and Global Optimization Mechanism for Relational Databasexe2x80x9d, Proc. VLDB, 1985, pp. 405-417. According to the technique indicated in Satoh et al., the processing procedure is decided by inferring the amount of data to be processed from the query condition.
In a large number of DBMSs, the query process is implemented via processing of two phases consisting of the query analysis process and query execution process. For example, when embedding a query into an application program described in a host language such as COBOL or PL/I, the query analysis process is performed for the query embedded in the application program before executing the application program and an internal processing procedure is generated in the executable form. The query process according to this internal processing procedure is executed when the application program is executed. In most cases, a variable used in the host language is contained in the retrieval condition expression which is described in the query. A constant is substituted for this variable when the internal processing procedure obtained as a result of the query analysis process is executed, that is, when the query process is executed. In this case, a plurality of optimum processing procedures can be considered according to the value which is substituted for the variable when the query process is executed. Therefore, there is a problem imposed that a processing procedure which is obtained by the query analysis process beforehand is not always optimum. To solve this problem, a technique is known that a plurality of processing procedures are generated beforehand when the query analysis process is performed and the processing procedure is selected according to the value which is substituted for the variable when the query process is executed. Such a technique is indicated in, for example, U.S. Pat. No. 5,091,852 or Graefe, G., et. al. xe2x80x9cDynamic Query Evaluation Plansxe2x80x9d, Proc. ACM-SIGMOD, 1989, pp. 358-366.
An offer of a parallel database system which is scalable in correspondence with an increase in the transaction amount and an increase in the database amount which exceed an increase in the CPU performance of computer systems and an increase in the storage capacity of disk units is desired from users recently. Performance requirements for database systems which are desired by users are application to more than tens of thousands of users in concurrent execution, realization of retrieval transactions in units of tera bytes, and guarantee of a response time which is not in proportion to the table size. As a system in response to such a request, a great deal of attention is attracted to a parallel database system jointly with a recent reduction in the hardware cost. The parallel database system is described in, for example, DeWitt, D., et. al.: xe2x80x9cParallel Database System: The Future of High Performance Database Systemsxe2x80x9d, CACM, Vol. 35, No. 6, 1992, pp. 85-98. In the parallel database system, a plurality of processors are tightly or loosely coupled with each other and the database process is distributed to these plurality of processors statically or dynamically. In each node (a processor or a pair of a processor and disk unit), database operations are executed in parallel or in the manner of the pipeline operation. Even in such a parallel processing system, the processing procedure can be selected in each node by applying the aforementioned technique.
Generally in a parallel database system, as the parallelism increases, the response performance improves. However, when the parallelism is excessively increased, problems such as an increase in the overhead or an increase in the response time of transactions may be imposed. Therefore, it is important to set a moderate parallelism. However, in a conventional parallel database system, a reference for deciding the number of nodes to be used for database operations is not defined. Therefore, it is difficult to obtain an appropriate parallelism and to realize an optimum load distribution. Data to be used for database operations is separately stored in each node. If there is a scattering in the data amount stored in each node when performing database operations in the manner of the pipeline operation, the processing time in each node is biased and the pipeline operation cannot be performed smoothly.
An object of the present invention is to eliminate the aforementioned difficulties in a conventional parallel database system and to provide a database management system and a database processing method for realizing a quicker query process.
The database management system of the present invention has a plurality of nodes for executing the database process in a suitable form and is structured so that these plurality of nodes are connected to other nodes via a network. The plurality of nodes include at least one distribution node having a storage means of distributing and storing the database to be queried and a distribution means of retrieving information from the above storage means and distributing the retrieved information to other nodes. The plurality of nodes also include at least one join node having a sorting means of sorting information distributed from the distribution node, a merge means of merging the plurality of sorted information, if any, and a join means of joining a query on the basis of the merged information.
Furthermore, the plurality of nodes include at least one decision management node having an analysis means of receiving a query, analyzing the query, and generating the query processing procedure, a decision means of deciding the distribution nodes and join nodes for performing the execution process on the basis of the query analysis result of the above analysis means, and an output means of outputting the result for the query obtained from the join node. The decision means of the decision management node desirably decides the distribution node on the basis of the query analysis result of the analysis means, calculates the expected processing time in the distribution node, and decides the join node on the basis of this processing time.
The decision means distributes retrieval information equally to each join node on the basis of the expected retrieval information amount in the decided distribution node. Each of the distribution nodes decided by the decision means retrieves information from the storage means on the basis of the query analysis result and distributes the information to another node. The join node inputs information distributed from the distribution node one by one and processes each inputted information. The distribution node and join node process information independently. Each of the join nodes sorts information distributed from the distribution node, merges the sorted information when it consists of a plurality of information types, joins a query on the basis of the merged information, and outputs the result for the query obtained from the join node.
To assign retrieval information equally to the join nodes by the decision means in a more desirable form, the decision management node has a storage means of storing column value frequency information relating to the information of the storage means of each node.
According to the query processing method of the present invention, the number of nodes can be decided in correspondence with the database operation which is executed in each node. When there is a scattering in distribution of data, the data is equally distributed to each node, and each database operation to be executed in each node is parameterized, and the expected processing times are equalized. Therefore, the processing time in each node is not biased and the pipeline operation can be performed smoothly.