A. Field of the Invention
The present invention relates generally to optimizing memory allocation, and more particularly to using evolutionary algorithms to automatically optimize the memory allocation of a software application in a multiple memory architecture.
B. Description of the Related Art
Allocating application processing to various processors and memory banks in complex embedded multiprocessor systems to achieve optimum performance is an extremely difficult problem. When developing a real-time application on a multi-processor non-uniform memory access (NUMA) system, there are a number of factors that can affect application execution speed including processor types, memory types, concurrent utilization of memory and bus resources, and hierarchical interconnection schemes. These factors all affect application execution in ways that are both difficult to model and to predict. Even when these effects can be measured, it can be difficult for a programmer to use this information to minimize the combined effect of all the factors on execution time.
FIG. 1 shows an example of a multi-processor NUMA system 100, such as a VME based quad Texas Instruments C40 Digital Signal Processor (DSP) board manufactured by Mizar Inc. (Model MZ7772), on which an application may be run. System 100 includes a plurality of ALU processor cards 110 each having an ALU 112, a small and very fast on chip static RAM (SRAM) 114, a private, small and fast memory 116 and a shared, small and fast memory 118. A buffer 120 interconnects each of the cards 110 to a local bus 130. A large, slow shared memory 140, such as a global SRAM or hard disk drive, is coupled directly to local bus 130. System 100 also includes an asynchronous interface 150, which couples local bus 130 with a VME bus 160 to which a very large and slow memory 170 is connected.
As shown in FIG. 1, the memory hierarchy in system 100 includes four distinct banks of memory, each with different performance levels including small and very fast on chip SRAM 114, small but not as fast memories 116 and 118, large and slow shared memory 140, and very large slow globally shared memory 170. In an architecture such as system 100 in FIG. 1, the choices a programmer makes in selecting where data and intermediate results are stored can have a significant impact on program performance, often by an order of magnitude or more. DSPs can use, for example, the Harvard bus architecture, which has multiple buses that allow a CPU to retrieve data from two different memory banks in parallel. Furthermore, instructions are used that can fetch two pieces of data and perform a mathematical calculation all in one clock cycle.
In a simple implementation of a DSP algorithm, only one bank of memory is used for both the program code and data. This implementation exhibits poor performance due to excessive contention for the single memory bank. The situation can be exacerbated by the use of on-chip direct memory access (DMA) engines, which are commonly used to move input and output data between processors and I/O subsystems and can cause further contention on the various data buses available to the system.
Performance can be improved by separating data that is retrieved in parallel into different banks. Another option is to place the buffers that require the greatest amount of access into the fastest memory banks, although these fast memory banks may not be large enough to allow this allocation in some memory-restricted architectures. Thus, the buffer allocation problem can be viewed as a combination of an optimal packing problem and a contention reduction problem.
The manual approach to optimizing the data buffer allocation is commonly done by trying various placements of buffers to different memory banks and measuring the resulting execution time of the software. Many iterations are usually necessary to optimize performance. The difficulty of the problem rapidly increases with the number of buffers that need to be allocated. With many modem applications requiring hundreds of buffers, the manual approach is beyond what can be effectively solved by a programmer, requiring the use of sub-optimal techniques such as local optimization of key execution loops, rather than the entire program.
A method consistent with the present invention for optimizing the allocation of data buffers referenced by a software program among a plurality of memory banks in a target system generates randomly a population of trial allocations of the data buffers among the plurality of memory banks, determines a fitness score for each of the trial allocations, creates a new trial allocation based on at least one of the trial allocations in the existing population, repeats the determining and creating steps until the occurrence of a predetermined event, and identifies the trial allocation having the best fitness score.
Both the foregoing general description and the following detailed description provide examples and explanations only. They do not restrict the claimed invention.