The computational requirements in the scientific data processing field have increased to the point where it is difficult to build a single processor which has sufficient performance. This is due to limitations in the physical technology for building computers. One approach to solving this problem is the use of multiple processors for increasing data processing power for a system. However, many problems are encountered in attempting to use multiple processors. Multi-processor systems can simultaneously execute several unrelated processes (programs) and this has the effect of increasing system throughput. However, there is a need for a multi-processor system which is capable of accelerating the execution of a single process.
A number of systems have been described which have the goal of increasing the computational rate for a single process, but each of these systems suffer from one or more deficiencies. One example of such a system is described in U.S. Pat. No. 4,636,942 to Chen et al. This patent describes a system in which multiple processors share a set of common registers. These registers are divided into clusters, and the processors which are assigned to execute the same task are assigned access to a common cluster by the operating system. The cluster registers, together with a set of hardware semaphores, provides fast synchronization between the cooperating processors. This system has a number of drawbacks which tend to slow the performance of the system when used for parallel processing. When a process initiates parallel operations, the cluster allocation of registers must be performed by the operating system. This transfer of control from the executing process to the operating system and back is quite time consuming. After processor allocation, task synchronization between the processors is performed by a run-time library which must be embedded in the process code and requires substantial execution time to perform the task synchronization. Further, the invocation of multi-tasking for parallel execution of a process in Chen et al, is done explicitly by the programmer. Thus, parallelization is invoked only when previously selected by the programmer. Basically, the Chen et al approach utilizes a substantial time overhead to invoke multi-tasking within a process. As a result, the system described in the Chen et al patent can be efficiently used only if there are very large segments of parallelizable code within a process. This is termed "coarse granularity parallelism." The system described in Chen et al cannot be efficiently utilized to execute small granularity parallelism.
Another approach which has been proposed is to allocate processors to a task when the process is loaded for execution. By allocating processors in advance, the parallel segments of the process can be executed with reduced synchronization time. But, most processes have both serial and parallel code segments, and when a serial segment is being executed on one processor, the remaining allocated processors are idled. When the process is being executed in serial code, the remaining allocated processors are idled. Thus, in most practical applications, the approach of allocating processors in advance for parallel execution results in a loss of system throughput due to the idling of unneeded processors during serial execution.
In view of the need for increased processing speed for single processes and the difficulties which have been experienced in attempting to utilize multi-processors, there exists a need for a method and apparatus which can execute the parallel segments of a process with a low time overhead while not idling processors when a process is being executed in a serial segment.