The present invention relates to a system and method for optimizing software performance, and more particularly to using evolutionary algorithms to automatically allocate application processing in complex multiprocessor environments.
Allocating application processing among the various processors in complex multiprocessor systems to achieve optimum performance is an extremely difficult problem. When developing applications for multi-processing systems, the software design is driven both by algorithmic considerations and by system performance considerations. With modern multi-stage interconnect systems (e.g., Mercury""s RACEWAY, Sky""s SkyChannel, and the Intel Paragon) the effect of inter-processor data routing on system performance is often difficult, if not impossible to predict, particularly for large systems.
As an example, consider the Simple Signal Processing (SSP) application 100 of FIG. 1. SSP application 100 includes a plurality of xe2x80x9cmatchedxe2x80x9d filters 110, a plurality of 2-D Fast Fourier Transform (FFT) beamformers 120, and a plurality of detectors 130. The output of every processor in each stage is broadcast to all processors in the next stage via data broadcasting functions 140 and 150. Data from M channels are fed through a simple time domain matched filter 110 using an overlap-add FFT technique (FFT length N). The overlap-add technique is a conventional signal processing technique as discussed in Digital Signal Processing by Alan V. Oppenheim and Ronald W. Schafer, Prentis Hall, Copyright 1975, incorporated herein by reference. All channels of each sample time are then fed or broadcast, via data broadcasting function 140 to a simple 2-D FFT beamformer 120, producing B beams. The beam data is then broadcast via data broadcasting function 150 to a simple threshold detector 130 that sets thresholds in each beam by examining adjacent beams.
Mapping this algorithm onto P processors involves the consideration of several tradeoffs, such as the size of the individual tasks, the relative speed of the processors, the interconnect, and the location of processors on the interconnect topology. Task size is chosen to maximize processor usage and minimize idle time. The ability to keep these processors busy is driven by the ability of the interconnect to move input and output data from functional units operating on different processors. Balancing I/O requirements and task processing size requires balancing utilization of the interconnection bandwidth and the processor execution speed. Proper placement of functional components on the multiprocessor can reduce the amount of communication necessary by confining intermediate results to a single processor. Many modem multi-processor connection circuitry divide the processor pool into sub-groups in such a way that communication with processors that are closely located is preferable (in terms of contention and speed) to communication with more remote processors. This affects function placement by imposing a preferred assignment so that functions that must share data are located close to one another.
Using the SSP application 100 as an example to illustrate the tradeoffs inherent in the processor placement, divide the filter processing into M parallel filters. Similarly, divide the beamformers 120 into S tasks, each generating all the beams for one time sample. Finally, break the detectors 130 into B parallel tasks, again one per beam. Thus, we would have M+S+B=T parallel tasks.
Several problems arise with this assignment scheme. First, the samples from each of the M matched filter 110 tasks must be routed to all of the B beamformer 120 tasks. Additionally, since many multiprocessor connection circuits are most efficient transferring contiguous blocks of data (as opposed to single data words), for certain systems it will be far more efficient to have the beamformer 120 tasks work on a contiguous block of samples (as opposed to single time slices). To simplify programming, this block size is often tied to the size of the FFT used for the matched filter 110 convolution (usually N divided by a power two), and is chosen to correspond to an efficient transfer block size. Therefore, the number of parallel tasks for beamforming becomes N/2i, where i is a design parameter. Finally, the output of the beamformer 120 tasks must each be sent to some of the B detectors 130, since these detectors use neighboring beams for threshold setting (in the most extreme case, all the beams would be used for each detection). Again, for performance and programming simplicity, this is best done in blocks.
An engineer with some knowledge of the system can make reasonable estimates for allocating application processing to achieve optimum performance given these processing parameters. However, optimum selections of these values will vary from machine to machine, and usually have to be determined by manually timing the tasks and hand-tuning the values. The difficulty of the problem rapidly increases with the number of processors that need to be allocated. With many modem applications requiring hundreds of processors, the manual approach is beyond the capability of even the most skilled engineer.
A method consistent with the present invention optimizes the allocation of processes among a plurality of processors in a target system. In operation, the method first generates a random population of trial allocations of the processes among the plurality of processors in a target system. Next, a fitness score is determined for each of the trial allocations. Then, a new trial allocation is created based on at least one of the trial allocations in the existing population. Finally, the system identifies the trial allocation having the best score.
Additional desires, features and advantages of the invention are set forth in the following description, apparent from the description, or may be learned by practicing the invention.
Both the foregoing general description and the following detailed description are exemplary and explanatory and are intended to provide further explanation of the invention as claimed.