Parallel processing, which generally comprises employing a plurality of microprocessors coupled to the same system to concurrently process a batch of data, is of great importance in the computer industry. Generally, there are three major types of parallel processing. These are parallel processing systems employing shared memory or distributed memory or a combination of the two. Typically, shared memory is memory that can be accessed in a single operation, such as a “load” or “read” command, by a plurality of processors. Distributed memory is memory that is localized to an individual processor. In other words, each processor can access its own associated memory in single access operation, but typically cannot access memories associated with other processors in a single operation. Finally, there is a hybrid, or “heterogeneous,” parallel processing, in which there is some shared memory and some memory which is distributed.
One such example of a hybrid parallel processor system comprises a reduced instruction set (RISC) main processor unit (MPU), such as a PowerPC™ processor, and a specialized, or “attached” processor unit (APU), such as a Synergistic™ processor unit (SPU). Typically, the MPU is employed to execute general purpose code, wherein the general purpose code comprises complex control flows and orchestrating the overall hybrid parallel processing function. The MPU has access to the full range of system memory. Although in one embodiment, only one MPU is used, in other embodiments, more than one MPU is used. The APU is generally directed to executing dataflow operations. In other words, the APU calculates highly repetitive multimedia, graphics, signal, or network processing workloads which are identified by high compute to control decision ratios. In conventional hybrid systems, APUs do not have direct access to the system memory, and their own memory, the local store, is typically significantly smaller than the shared memory.
Generally, while employment of the hybrid system provides high computational performance, it poses significant challenges to the programming model. One such problem relates to the APU. The APU cannot directly address system memory. Therefore, any code to be run on the APU has to be transferred to an associated local storage of the APU before this code can be executed on the APU. Another significant challenge presented by such a hybrid system is that the main and attached processors may have distinct instruction sets and micro architectures.
One problem in managing program execution in an APU is that the code and data in the APU can exist in the same, limited size, unpartitioned local memory of the APU, thereby leading to information manipulation issues. Also, functions that execute in the APU, after having been invoked from the MPU, can frequently need to access common, or “shared,” data sections and to execute other common, or shared, subroutines. Finally, the code destined to execute in an APU could be larger than can reasonably fit in the memory of the APU.
One component in extracting performance from a computer architecture such as described above, is the creation of an optimum partitioning between code and data. Such a partition should facilitate code and data reuse, and hence minimize data transfer. An efficient partitioning of code, allows for the execution of programs that would otherwise be too large to fit in the local memory of the APU. However, in heterogeneous systems such as described, other problems can arise implementing this efficient partitioning of code. The small memory size of the APU means that often, a particular function, targeted to the APU, can result in an APU code stream which is in fact too large to execute efficiently on the APU.
In one embodiment, multiple APU functions are employed to cooperate on a single problem through the, potentially extensive, sharing and reuse, of various code and data sections within the single combined binary. A naive memory management scheme has the potential in such cases to result in significant performance inefficiencies in terms of both time and space. Typically, this is because there can be multiple copies of the code and data for each targeted APU function, residing in the combined binary in the MPU system memory. Moreover, this conventional approach requires significant overhead in memory transfer, since multiple instances of pre-bound APU modules can obtain identical copies of shared APU code and data. This can slow memory traffic considerably, and may introduce unnecessary delays in processing in the APU local store.
Therefore, memory management is required, which will operate in a heterogeneous architecture and overcome the deficiencies of conventional memory management.