1. Field of the Invention
The present invention generally relates to a multiprocessor system and a method of load balancing thereof, and more particularly to a multiprocessor system which has a plurality of processors and a network system and a method of load balancing processing in the multiprocessor system in which a given computational task or load is divided into a plurality of load segments and each of the load segments is dynamically assigned to a predetermined processor while the multiprocessor system operates.
2. Prior Art
The conventional multiprocessor system has a plurality of processors and a network system. In the case where a given computational task or load written in a logic programming Language (e.g., Prolog) is executed in parallel in the conventional multiprocessor system, the given load (or an initial goal) is divided into plural initial load segments which are assigned to all of the processors at an initial load balancing stage. More specifically, a first initial load segment is given to a first processor wherein data representative of the processing result of the first initial load segment is obtained, and such data must be transferred to a second processor which starts to process a second initial load segment thereof by use of such data. Thus, data representative of the processing result in the presently operating first processor must be transferred to the next processor, which is idle during the operation of the first processor, but will start to process its initial load segment by use of the data from the first processor. As described heretofore, the initial load segments are sequentially assigned to the processors in turn. Hence, the conventional multiprocessor system requires a long processing time before the given load is executed in parallel.
At a time when the initial goal is given to one processor, all of the other idle processors within the multiprocessor system do not operate. Hence, one processor must divide the given initial goal into plural initial load segments which must be assigned to the other processors. In addition, the conventional multiprocessor system must provide the network system for transferring information concerning the given initial goal which must be divided. For this reason, the conventional multiprocessor system cannot perform an initial load balancing of the initial goal with high speed. Originally, it is possible to obtain a performance improvement due to a parallel effect for shortening processing times (hereinafter, simply referred to as the parallel effect) when the given load is executed in parallel in the conventional multiprocessor system. However, the conventional multiprocessor system suffers from a problem in that it is not actually possible to obtain such parallel effect because of the reason described above.
Further, the above mentioned one processor supplied with the initial goal must transfer a certain part of the information thereof to all of other processors so that the amount of information to be transferred is increased. Hence, the conventional multiprocessor system suffers another problem in that it must have the ability to transfer data at high speed and transfer a large quantity of data for the network system.
Next, a description will be given with respect to the above-mentioned problems in detail by considering that the logic programming language (i.e., Prolog) is executed in the conventional multiprocessor system.
In a process for sequentially executing Prolog (shown in FIG. 2), a predetermined priority (i.e., a depth-first-search) is given such that branches are searched from an upper side to a lower side and from a left side to a right side within an inference tree (or a proof tree) of Prolog. When the system fails to find the correct branch (or the desirable branch) while searching, the system backtracks to the preceding node and searches all branches connected thereto so as to find the correct branch.
In another process for executing Prolog in parallel, plural processors simultaneously search a certain section or all sections of the inference tree so as to find the correct branches in accordance with a predetermined breadth-first-search. Such a process is called an OR parallel execution in which all branches within the inference tree are divided into plural sections (hereinafter, referred to as OR processes) each having a certain number of the branches and all of the OR processes are respectively assigned to the idle processors when the initial goal is given to the system. In this case, information required to execute each OR process must be transferred to the corresponding idle processor.
As described before, one processor supplied with the initial goal must divide the given initial goal into plural initial load segments which must be assigned to the other idle processors at the initial load balancing stage. Hence, the conventional system can not perform the initial load balancing with high speed.
Meanwhile, after the load balancing is performed between the first and second processors within the multiprocessor system, it is desirable that the first and second processors be able to independently proceed with their respective processes without transferring data representative of the working environment of the first processor from that processor to the second processor.
In order to realize the above-mentioned load balancing within the conventional multiprocessor system, a predetermined working environment required for the second processor must be extracted (or selected) from the working environments which are obtained by performing predetermined processes within the first processor, before performing the load balancing in the first processor, and such predetermined working environment must be transferred to the second processor.
In other words, the above predetermined working environment is identical to the information which is obtained by performing predetermined processes other than the load balancing process within the first processor. Such a predetermined working environment is necessary for the second processor in the case where a certain part of the load to be executed in the first processor is shared with and executed by the second processor. In addition, the amount of information representative of the working environments increases as the system proceeds to balance the load. Therefore, quite a large amount of information must be transferred to the other processors when the load balancing is performed after a long process is performed in each processor.
As described heretofore, the first processor must stop performing its original process and extract the predetermined working environment required for the load balancing from its working environments (at a load generation stage), and then such predetermined working environment, which has a large amount of information, must be transferred to the second processor. Thereafter, the second processor must store the transferred information (at a load storing stage) so it can proceed with its original process. Specifically, a data conversion is required in order to transfer such information by use of the network system. In the present specification, the meaning of the data conversion will be considered to be included in the meanings of the above load generation and load storing.
As shown in FIG. 1, overhead time must inevitably be provided for with the above-mentioned load balancing in the conventional multiprocessor system. In FIG. 1, the first processor cannot prevent a first overhead time from occurring, and the second processor also cannot prevent a second overhead time from occurring.
Due to the overhead time accompanying the load balancing (or due to the stopping of the process in the first processor in particular), each processor can not demonstrate its processing ability by every time unit. In addition, the load balancing is required to be performed between processors at an arbitrary and asynchronous time. Hence, the conventional multiprocessor system suffers from the problem in that it is not possible to demonstrate the parallel effect as described before. This parallel effect can be evaluated by the total ability which can be obtained from the following formula: (Total Ability)=(Processing Ability of each processor).times.(Number of processors which are operable in parallel in order to process the given load). Hence, the conventional system needs a network system having a high cost to transfer the large amount of information with arbitrary and asynchronous timing. In order to transfer the large amount of information, the network system must be occupied for a long time, hence, it becomes impossible to perform the load balancing between the processors properly. Therefore, the conventional system stiffers a problem in that a load unbalancing must occur.
Compared to an improvement in the processing speed of the processor, an improvement in the transfer speed of the network system within the multiprocessor system has relatively little effect. This results in a tendency to increase the communication time of the network system more than that of the processors. In this case, the above-mentioned problem becomes serious. As the number of processors within the multiprocessor system increases, such a tendency becomes rather remarkable.
Next, a description will be given with respect to the above-mentioned problem in conjunction with FIG. 2 when Prolog is executed in parallel in the multiprocessor system.
In the case where the first processor performs the load balancing on the second processor in the OR parallel execution described before, the first processor divides an OR process from all branches of the inference tree, and the divided OR process is assigned to the second processor.
In this case, transfer data (to be transferred from the first processor to the second processor) can be classified as first and second transfer data. The first transfer data represent the information of the divided OR process. The second transfer data represent the information of the divided OR process and other information which is required to execute the divided OR process.
The first processor must transfer the above second transfer data to the second processor while the first and second processors independently proceed with their respective processes after the load balancing is performed. This happens because the second processor must refer to the working environment of the first processor when the first processor transfers the first transfer data to the second processor, instead of the second transfer data.
However, the second transfer data must include data representative of the large amount of information of the working environment of the first processor which is necessary for executing the divided OR process. This working environment in the Prolog execution includes "bind information" representative of a connection relation between variables and values and "control information" for controlling the backtracking of Prolog, for example.
The above-mentioned working environment is produced by the first processor before performing the load balancing. The second processor requires such working environment to execute the divided OR process after the load balancing is performed. Because, when the second processor independently obtains a solution (or a processing result) of the initial goal by performing the divided OR process, the second processor may need all of the bind information which is produced by the first processor between a time when the initial goal is given and a later time when the first processor starts to perform the load balancing. In addition, the amount of such bind information must be increased nearly in proportion to the processing time. Therefore, the first processor must transfer quite a large amount of information representative of its working environment to the second processor when the first processor performs the load balancing on the second processor after a long processing time has been passed.
Since the first processor must divide the OR process and transfer its large amount of information representative of the working environment when every time the first processor performs the load balancing on the second processor, the original process of the first processor must be stopped so it performs intermittently. On the other hand, since the second processor receives the working environment of the first processor every time the load balancing is performed, the original process of the second processor must be stopped history information order to receive the large amount of information representative of the working environment of the first processor and to store such transferred information.
Therefore, each processor can not demonstrate its full processing ability. In addition, the load balancing is required between the processors at arbitrary and asynchronous times. Hence, the multiprocessor system suffers from the problem that it is impossible to obtain the parallel effect as described before.
Further, the conventional system requires an expensive network system to transfer large amounts of information at arbitrary and asynchronous times. Since the network system in this case is occupied for a long time in order to transfer the large amount of information, it becomes almost impossible to perform the load balancing between the processors. Therefore, the conventional multiprocessor system suffers from the above described problem in that the load becomes unbalanced.
The above-mentioned problem becomes serious in a recently developed sequential inference machine (or a Prolog machine), which machine can sequentially perform the inference by itself with high speed. When the multiprocessor system controls one thousand or more of such machines (i.e., the processors) in parallel, the conventional system has the tendency to cause the improvement of the data transfer speed of the network system to become smaller than that of the processing speed of each machine, as described before. As the number of the processors within the multiprocessor system increases, the above-mentioned tendency becomes even greater.
A sequential inference machine of 1 MLIPS (i.e., one Mega Logical Inference Per Second) produces a working environment having about 5 MW (i.e., five Mega Word) of information (in case of 40 Bit/W). For example, a serial link of 10 MBPS (i.e., ten Mega Bit Per Second) is actually required between two mutually adjacent processors as the network system which connects all one thousand of the sequential inference machines provided within the multiprocessor system. In this case, it is possible to transfer data of 0.25 MW per second (which is obtained by dividing 10 MBPS by 40 Bit/W) representative of the working environment between two mutually adjacent processors.
In this case, the processing time for performing the inference divided by the communication time of the network system becomes equal to 1/20. The value 20 which appears in the denominator is obtained by dividing 5 MW by 0.25 MW. Due to the load balancing (or due to the transfer of the large amount of information in particular), the sequential inference machine (i.e., the processor) must stop performing the original inference process for a long time. Hence, the apparent processing ability of the sequential inference machine must be lowered.
Since the operating processors and the network system are occupied in order to transfer the information representative of the working environments for a long time, it becomes impossible to perform the required load balancing so that the availability of the processor must be lowered. Thus, the parallel effect applied to the multiprocessor system must be lowered as described before.