1. Field of the Invention
The present invention relates to a parallel processor system having a plurality of processors connected to respective secondary memory devices such as magnetic disk devices connected through a network so as to perform one process as a system, in particular, to an inter-processor data division system for equalizing data processed in each processor that composes a parallel processor system.
2. Description of the Related Art
The parallel processor systems are categorized as a data sharing system and a data distributing system. In the data sharing system, each process in the system can freely access data stored in, for example, a plurality of magnetic disk devices through a switch. On the other hand, in the data distributing system, dedicated secondary memory devices are connected to respective processors and the processors exchange data through the network.
FIG. 1 is a block diagram showing the construction of a data distributing type parallel processor system. In FIG. 1, dedicated secondary memory devices (for example, magnetic disk devices 11) are connected to processors 10 that compose the system. The processors 10 exchange data therebetween through a network 12.
Next, hash-join in a database process of the data distributing type parallel processor system shown in FIG. 1 will be described. The hash-join is an algorithm that is referred to as equivalent-join.
In FIG. 1, the contents of a first table R and a second table S of a database are distributively stored in the magnetic disk devices 11. In the table R, identification numbers of employees and the names of the employees corresponding to the identification numbers are stored. In the table S, the identification numbers of the employees and the annual incomes of the employees are stored. In this case, the equivalent-join process is a process for retrieving the contents of the tables R and S and generating a third table that stores pairs of the names of the employees and the annual incomes of the employees with keys of the identification numbers of the employees. The hash-join is an equivalent-join that is performed in a data distributing type parallel processor system. In this process, the identification numbers are categorized as a plurality of groups. Each processor transmits data of the same group, namely the contents of the table R and the table S, to a processor that performs the equivalent join-process for the data of the group. After all data of the group has been transmitted, the processor performs the equivalent-join process.
Next, the hash-join process will be described with steps 1 to 4.
Step 1: Each processor (processor number 0 to N.sub.pe -1) reads data that is logically treated as one block (this data is referred to as a record or table), applies a predetermined grouping function for the data, and determines a processor that processes the data.
Step 2: Each processor transmits the data to the determined processor.
Step 3: After the above steps have been performed for all data, each processor receives data to be internally processed.
Step 4: Each processor independently performs the join process.
In each processor, the same grouping function should be used. In addition, corresponding to the same data value, the same value should be returned. When data is transmitted between processors, a data group with the same output value of the grouping function should be transmitted to the same processor.
Thus, since data with the same data value is transmitted to the same processor, a data process with the same data value can be executed in the designated processor as a closed process.
However, in the above-described process, if the distribution of the output values of the grouping function largely deviates (namely, if only a particular processor transmits a large amount of data, the operation performance of this processor becomes a bottle neck), the performance of the entire system deteriorates.
For example, if the names of employees are stored in the two tables R and S and the names are grouped with a key of family names, since data amount of groups for typical family names such as SUZUKI and TANAKA, which are typical Japanese family names, is larger than that of other groups. Thus, the load of the processor that processes the data of such groups becomes large, thereby deteriorating the performance of the entire system. To prevent the performance from being deteriorated, a bucket tuning process is performed.
A bucket is a block of data grouped corresponding to, for example, identification numbers. In the bucket tuning process, a grouping function is properly selected so as to remarkably decrease the size of the buckets. In addition, one processor processes data of a plurality of buckets so as to equalize the total data of buckets processed by each processor in the parallel processor system. This process is performed with steps 11 to 14.
Step 11: The types of the output values of the grouping function should be remarkably larger than the number of processors. More practically, a block of data groups with the same output values of the grouping function is referred to as a sub-bucket. A grouping function is selected so that the size of the largest sub-bucket is satisfactorily smaller than the value of the total data amount divided by the square of the number of processors. All sub-buckets with the same output values of the same grouping function collected from all the processors in the system compose a bucket.
Step 12: The grouping function is applied for all data so as to determine the size of each sub-bucket.
Step 13: A combination of sub-buckets is considered so that the process data amount of each processor is equalized. To do that, a combination of sub-buckets is stored. (When a particular processor combines a sub-bucket B and a sub-bucket C and transmits the combination of these sub-buckets to a processor D, all processors should combine the sub-bucket B and the sub-bucket C and transmit the combination of these sub-buckets to the processor D. Thus, the above-described evaluation is performed corresponding to the information of the size of each sub-bucket by each processor.
Step 14: When a real data process is performed, each processor combines sub-buckets corresponding to the above-described information and transmits the combination of these sub-buckets to another processor. When data with the same values of a grouping function is transmitted to a processor, this data is referred to as a bucket.
In step 14, each processor provides a plurality of data buffers corresponding to buckets in the main memory. While generating data (for example, data is being read from a secondary memory), each processor applies a grouping function for the data, evaluates the data (namely, divides the data into sub-buckets), and stores the sub-buckets in corresponding data buffers. When the data storage amount stored in each data buffer exceeds a predetermined threshold value, the processor transmits the content of the data buffer (namely, part of the bucket) to a corresponding processor.
The reason why the data buffering process is performed is in that an inter-processor transmission means has a large overhead that is irrespective of the data amount. To prevent the transmission performance from being deteriorated, a definite amount of data should be transmitted at a time.
Generally, the storage capacity of the main memory of one processor in the parallel processor system is remarkably smaller than the value of the total amount of data processed in the system divided by the number of processors (namely, the total amount of data transmitted to one processor as a combination of buckets). To speed up the above-described hash-join process, data being grouped should be stored in the main memory. In other words, a bucket should be stored in the main memory. Due to decrease of memory cost, a set of data blocks can be stored in the main memory and the data can be processed at high speed. Next, as a related art reference, a practical processor system in which buckets are generated so that one group of data is stored in the main memory and processed in each processor will be described in detail.
FIG. 2 is a block diagram showing a construction of a parallel processor system. In FIG. 2, (n+1) processors 15 (A0 to An) are connected to each other through an external bus 16. Each processor 15 comprises a CPU 20, a main memory device 21, a secondary memory 22, and an input/output port 23. The input/output port 23 inputs and outputs data between the main memory device 21 and the external bus 16. Next, data transmission performed between two processors will be described.
FIG. 3 is a block diagram for explaining a related art reference of data transmission performed between two processors. Referring to FIG. 3, data transmission performed from a processor Ai to a processor Aj will be described. Reference numeral 24 is data that is read from a secondary memory 22i of the processor Ai. The data is for example one record. Reference numeral 25 is a buffer group 25 that temporarily stores the data 24 before transmitting the data 24 to another processor (in this example, Aj). Reference numerals 26i and 26j are processors that perform respective data transmission processes. Reference numeral 27 is data that is transmitted from the processor Ai. Reference numeral 28 is a conversion table that is an intermediate bucket storage region for determining to which buffer of the buffer group 25 the data 24 read from the secondary memory 22i is stored. Reference numeral 29 is a region of the secondary memory 22j that stores transmitted data in the processor Aj.
In FIG. 3, on the processor Ai side, the process 26i applies a grouping function for the data 24 read from the secondary memory 22i so as to group the data. A bucket that includes the data depends on the value of the grouping function. In addition, a processor to which the data is transmitted is determined. Information about in which bucket data is included and to which processor the data is transmitted has been determined by a parent processor (not shown) or one of the processors 15 shown in FIG. 2 as a combination of buckets so that the load of each processor is equalized by preliminary reading the data from all the processors. The result is stored in the conversion table 28.
FIG. 4 is a schematic diagram showing a conversion table in a sender processor corresponding to a related art reference. In FIG. 4, reference numeral 28 is a conversion table. The conversion table 28 is composed of a conversion table 28a and a conversion table 28b. The conversion table 28a is used to obtain an intermediate bucket identifier that represents the relation between the output values of the grouping function and the intermediate buckets. On the other hand, the conversion table 28b is used to obtain the receiver processor number of a receiver processor to which the data is transmitted corresponding to the intermediate bucket identifier. The intermediate bucket is data that is transmitted to receiver processors before they are grouped as a final bucket. The intermediate bucket is equivalent to a sub-bucket of a sender processor.
There are many grouping functions applicable for the data 24. When data is an integer and the number of (intermediate) buckets that is the number of groups as the result of the grouping process for all data in the system is M, a remainder operation with a prime number that exceeds 5 M can be used as a grouping function.
When the number of (intermediate) buckets that are present in the system is M, the buffer group 25 is composed of at least larger than M (namely L+1) small buffers. Generally, to allow data to be stored in the buffer group 25 while the buffer group 25 is transmitting data to another processor, double buffering process is performed. Thus, the number of small buffers, (L+1), is larger than twice the number of buckets, 2 M.
The data 24 that is read from the secondary memory 22i on the processor Ai side is stored in a small buffer in the buffer group 25 corresponding to a bucket in which the data should be included. When the data storage amount of the small buffer exceeds a predetermined threshold value A, the data stored in the small buffer is transmitted to a processor where the intermediate bucket is transmitted. In this case, the processor is Aj. The transmitted data 27 is stored in the region 29 of the secondary memory 22j by the process 26j. With all intermediate buckets transmitted from all other processors, a final bucket is composed.
As described above (in FIGS. 2 to 4), in the conventional system, data to be transmitted to another processor is stored in a small buffer of the buffer group 25 and transmitted to the processor that processes the bucket. Since the number of small buffers becomes very large, they occupy most part of the main memory.
As described above, the storage capacity of the main memory of the processor is generally very small in comparison with the value of the total data amount handled in the system divided by the number of processors. In addition, the number of buckets that are present in the system is very large. Since the order of data that is read from the secondary memory cannot be predicted, the number of small buffers should be larger than the number of buckets that will be generated. In addition, the storage capacity of each buffer should be larger than a threshold value A that depends on the overhead of the data transmission performed between processors.
Although buckets that will be generated may be determined by a preliminary reading process, it is substantially impossible to store the determined result and reflect the result to the buffer management.
The number of buckets that are present in the system is equal to the value of the total data amount divided by the storage capacity of the main memory of each processor. Thus, in FIG. 3, the storage capacity of the buffer group 25, which temporarily stores the transmitted data to another processor, should be larger than the value given by the following expression. EQU A.times.total data amount/storage capacity of main memory of processor. . . (1)
where A is a threshold value for transmitting at a time data stored in one small buffer of the buffer group 25.
When the threshold value A is 64 KB, the total data amount is 64 GB, and the storage capacity of the main memory is 64 MB, the value obtained by the expression (1) becomes 64 MB. Thus, the buffer group 25 uses all the storage capacity of the main memory. In reality, it is impossible to accomplish such a system. The storage capacity of the buffer given by the expression (1) is required for each processor in the system. Thus, it is impossible to accomplish a parallel processor system that satisfies such a requirement.