In processing system applications, work load management is an important consideration. That is, it is important to keep track of what tasks there are to be completed, and to allocate the work based on the available processing resources to facilitate completion of the tasks. It has been found that employing some form of redundant processing capability is necessary or desirable in order to continue processing in the face of equipment failure. For instance, a processing system might include two or more autonomous processors, and some sort of means for allocating the available tasks to the processors. Thus, if one of the processors fails, the other processor is still available to complete the pending tasks, and the work can be allocated to it.
In such a multiprocessing system, there arises the question of how to utilize the processing resources most efficiently, while still ensuring that all of the tasks are completed. One possible strategy is to have each of the processors separately execute each of the pending tasks. This strategy is feasible as long as the tasks are of such a nature that they can be executed more than once, or suspended and resumed, without harm. Tasks which can be executed more than once without harm are called "idempotent", and include, for instance, verifying a step in a formal proof, evaluating a Boolean formula at a particular assignment of the variables, opening a valve, sending a message to a large group of processes, or reading records in a distributed database.
A strategy of having each processor execute all of the tasks ensures that the tasks will all be completed, as long as any of the processors are functioning. However, such a strategy utilizes the processing capability inefficiently. Most of the time most of the processors will be operational, but no more tasks are completed than would be the case if a single processor were executing all of the tasks. In addition, each one of the processors must each be powerful enough to execute all of the tasks by itself. As a result, the processors either must have greater processing power or require a greater amount of processing time than if they only had to execute a subset of the tasks. Thus, costs related to equipment or processing time are disadvantageously increased.
Another strategy for utilizing multiple processors to execute tasks involves allocating tasks among the processors, so that different processors execute different tasks. A controller is also provided for managing the workload, allocating the tasks among the processors, and identifying faulty processors. However, this strategy has the disadvantage that a failure of the controller brings the whole system to a halt, even though all of the processors might still be operational. Redundant controllers may be provided, but doing so disadvantageously adds to the cost of the system.
Three United States Patents, commonly assigned to Bendix Corp., disclose aspects of a system which attempts to improve on the above-described conventional approaches to multi-processor task management. These three patents are U.S. Pat. No. 4,318,173, issued to Freedman et al., and titled "Scheduler for a Multiple Computer System", U.S. Pat. No. 4,323,966, issued to Whiteside et al., and titled "Operations Controller for a Fault-Tolerant Multiple Computer System", and U.S. Pat. No. 4,333,144, issued to Whiteside et al., and titled "Task Communicator for Multiple Computer System". In this system, different tasks are allocated to different processors, and the processors exchange messages.
The '966 patent discloses a controller which is associated with one of the processors (each processor has such a controller). The controller manages the operation of the associated processor, and communicates with the other controllers associated with the other processors. The controller includes a task communicator, which is the subject of the '144 patent, and which assembles input data required for the execution of tasks allocated to the associated processor, makes the input data available thereto, and sends the other controllers the results of completed tasks, so that the other controllers may make the results available, as necessary, as input data for other tasks allocated to the other processors.
The controller further includes a scheduler, which is the subject of the '173 patent, and which selects which tasks, not selected by other processors, are to be executed by its associated processor, and schedules the selected tasks for execution. Tasks are executed on a data driven basis. That is, if a given task requires the results of one or more other tasks as inputs, then the given task is selected for execution only after all of the necessary inputs have been received from the other processors which executed the other tasks which produced the required inputs. Each task is assigned to at least two of the processors, although assigned tasks are selected for execution individually. Thus, a task which was executed by a first processor and whose results have been transmitted, may not be executed by a second processor to which it is assigned, as a result of the execution by the first processor. Also, the selection of a task for execution by one processor is communicated to the other processors, and the selection deters another processor to which the task was assigned from also selecting it.
Finally, the controller includes a fault handler, which identifies other processors as being faulty based on messages received therefrom, sends the messages from non-faulty processors to the scheduler, and sends the other controllers messages indicating which other processors it has identified as being faulty. Faults are detected by techniques such as comparing the results of an executed task with known range limits, comparing the results of the same task executed by two or more processors, using error detection codes, analyzing the scheduling sequence, or using watchdog timers. If it has been determined that a processor is faulty, such as by fault detection messages from two or more other processors, then the results of the faulty processor are discarded or ignored. However, the subsequent messages from the processor are monitored, and if the processor appears to have returned to normal operation, the processor is restored to full participation in the system.
It will thus be seen that the multiple-processor system described in the Bendix patents improve the efficiency of processor utilization by allocating different tasks to different processors, while allowing for processor failure. However, several disadvantages remain. First, since each task is allocated to at least two processors, there is inefficiency relating to throughput and overhead in scheduling and selecting each task. Moreover, if a task is assigned to several processors, they all may execute the task, thus utilizing processing resources inefficiently. The Bendix system does not provide load balancing or accommodate dynamic input of new pending tasks.
Also, the fault detection system is based on the receipt and identification of erroneous messages from a faulty processor. A crash failure, in which a processor ceases processing and sending messages altogether, is not efficiently detected, and the pending tasks assigned to the crashed processor are not efficiently reallocated. Also, the system does not tolerate multiple processor failures. If a pending task is assigned to a subset of the processors, then the task is not executed if all of the subset of processors fail, even though other processors might have available processing time.
Therefore, there remains a need for a multi-processing system which further improves the efficiency of processor utilization, which identifies crash failures and reallocates pending tasks efficiently, and which is sufficiently fault tolerant to guarantee that a pending task will be executed as long as one or more processors remain operational.