It is well known that most computer systems in current use consist of a single processor with concomitant memory and peripheral devices. Recently, however, multicomputer environments, consisting of the interconnection of multiple processors, have become available. In such environments, the computational tasks or loads are accomplished by distributing them across the available plurality of processors.
It is further known in the prior art that the preferable multicomputer operating environment is one in which parallel processing is performed. Generally, computer systems with parallel processors either have shared memory or dedicated memory. In shared memory computer systems, all of the available memory is shared among all of the parallel processors. Thus, the available memory is not associated with any individual processor but is a resource associated with the entire computer system On the other hand, in a dedicated memory computer system, the available memory is allocated to each individual processor. Each quantum of memory allocated to a processor is for that processor's exclusive use. No sharing between processors occurs.
Regardless of whether the memory in a parallel computer system is shared or dedicated for a particular process to be accommodated under this environment, its panoply of computational tasks must be subdivided into a set of parallel components. As is known to those skilled in the art, parallel components may be executed separately and independently of other parallel components. But as is further known to those skilled in this art, the subdivision of a process into parallel components is often a difficult task in itself. As an illustration, in U.S. Pat. No. 4,468,736, DeSantis, et al., disclose a method for decomposing a process into independent, disjoint tasks for parallel processing. Once the parallel components of a process have been established, it will become clear that they must be distributed among the processors of a multicomputer system to effectuate acceptable throughout.
The distribution or "balancing" of a multicomputer's load among its constituent processors may be referred to as "load balancing." Conventional load balancing methodologies have sought to allocate the various loads assigned to a multicomputer system by exploiting the architecture of a particular computer hardware configuration. This machine-dependency arises because the optimal distribution of tasks in a multicomputer environment may be achieved only by enumeration of all possible task configurations. Such enumeration is prerequisite to achieving the optimal balancing of tasks because the distribution of parallel tasks among a plurality of processors, like the traveling salesman and graph-partitioning problems, has been shown to be a member of the class of nondeterministic polynomial-time complete (NP-complete) problems. It is known to those skilled in the art that such NP-complete problems are intractable and defy analytical solution, as discussed by 0.1. El-Dessouki and W. H. Huen in the IEEE Trans., vol. C-29, no. 9, September 1980. pp. 818-825, in their article entitled "Distributed Enumeration on Between Computers."
In a parallel multiprocessor environment, the objective of load balancing is to distribute computational loads among these processors whereby each processor executes equivalent loads. Indeed, the more uniformly tasks are distributed among the processors, the more effectively the multicomputer system is executed because the processors are more likely to be actively performing computational tasks. This balancing is generally performed either statically or dynamically.
Static load balancing is conventionally used when the parallel computational components of a process can be completely ascertained prior to their execution. Dynamic load balancing is usually used when the attributes of the parallel computational components of a process vary over time, or when none of these attributes can be ascertained prior to execution.
For a multicomputer system with many tasks, the enumeration method of distributing tasks is clearly impractical and unmanageable. Accordingly, it is well known in the prior art that heuristic methods may be used to achieve a reasonable, albeit suboptimal, distribution of tasks as herein discussed. It is apparent in the prior art that to achieve optimal load balancing in a parallel processing environment requires a formidable expenditure of processing time. It is conventional to avoid these rigorous constraints by heuristically ascertaining a suboptimal load balance. Such a heuristic determination is achieved at a mere fraction of the system resource and without the hereinbefore mentioned information about the composition of the process load mix.
One such heuristic method known in the prior art is called "pipelining." This method is applicable to processes which can be subdivided into parallel processes which need minimal amounts of data. When the first available processor requests a load, a process and its concomitant data is pipelined thereto. As is known to those skilled in the art, this method is useful only if the computational time is longer than the time expended initiating the computation and communicating its results. It will become clear that if the contrary occurs, the processors tend to remain idle because too much time is expended on information flow.
Another method known in the prior art is called "vectorizing." This method is applicable to independent processes for which identical computations are performed. Multiple identical computations are performed on large arrays during each iteration, and each such iteration is uniformly distributed among the available processors.
Several methods and systems have been developed to improve the load balancing art. For example, Hartung, et al., in U.S. Pat. No. 4,633,387 teach a method of dynamic load balancing whereby work queues in a shared memory environment are examined to ascertain whether work-requesting thresholds have been met. Similarly, Ruhman, et al., in U.S. Pat. No. 4,491,932 disclose a method to partition shared memory for distributing the loads of disjoint processes into a reconfigurable array. In U.S. Pat. No. 4,495,570, Kitajima, et al. discloses a method for dynamically distributing the loads in a dedicated memory parallel processing environment whereby a processing request allocator executes service requests based upon process waiting and delay times.
Typically, the load balancing methodology used must handle an arbitrary set of tasks. That is, no a priori information about the number or size of the tasks is known. However, in applications where task information is known a priori special methodologies incorporating the task sizes and respective interdependencies therein may be developed. An example of such an application might be a mail carrier who is assigned a maximum amount of letters and packages to deliver in a predefined geographical area. A similar example might be the delivery of packages by Federal Express wherein each truck is allocated a maximum number of "loads" which are delivered to predefined locations. Another application might be the mapping of billions of stars in a galaxy whereby each connection between the stars exhibits an identical operation.
Another example of a set of tasks whose sizes and interdependencies are known is a simulated neural network. Such a network consists of multiple, similar processes, whereby nodes, called neurons, are systematically interconnected via synapses. The neurons may be subdivided into groups which it will be seen execute in parallel. For a typical neural network, consisting of hundreds of neurons and thousands of connections, it has been difficult to effective-y distribute the processing loads absent using the costly and time-consuming enumeration method.
In such a neural network where the processing at each node is identical, the prior art has been faced with two problems. The first problem is how to effectively deal with the large memory requirements of the network typically represented as arrays. The objective is for the processing units to perform the requisite calculations while utilizing minimal memory. The second problem is how to efficiently execute the myriad identical computations throughout the network. Since each node performs an action related solely to itself and to its interconnecting nodes one solution might be to allocate each node to a processor in a multicomputer environment. Each of these processors would execute the computations for one node in parallel with the computations executed by the other processors. It is apparent that this solution is impractical because multicomputer systems typically do not consist of hundreds of processors.
It is well known in the prior art that the typical multicomputer system consists of from four to one hundred processors. Accordingly, to efficiently process a neural network requires a method of grouping the myriad computations into subsets which can be distributed among the available parallel processors. The paper "Design of a Neural Network Simulator on a Transputer Array" by Gary McIntire, et al., presented at the Space Operations-Automation and Robotics Workshop at NASA / Johnson Space Center on Aug. 5-7, 1987, elucidates the nature of the problem and subsetting strategies.
As has been hereinbefore discussed, those skilled in the prior art are familiar with various static and dynamic methods which have attempted to distribute loads among parallel processors. For instance, the paper "Performance Tradeoffs in Static and Dynamic Load BaIancing Strategies" by Ashraf Iqbal, Joel H SaItz and Shahid H. Bokhari, under NASA contracts NAS1-17070 and NAS1-18107, describes the limitations of various static and dynamic load balancing methods. None of the methods referenced therein, however, has sought to accomplish such distribution concomitant with the utilization of minimal memory.