1. Field of the Invention
The present invention relates generally to stack data management in multicore processors. More particularly, it relates to techniques for stack data management in scratch-pad based multicore processors and Limited Local Memory (LLM) multicore processors.
2. Description of Related Art
As processors transition from few-core processors to many-core processors, scaling the memory architecture is becoming an important challenge. Intel dual core, quad core, and Nehalam architectures are shared memory architectures, in which the coherent caching mechanisms, typically implemented in hardware, provides the illusion of a single unified memory to the applications. This allows applications written in the unicore era to run on multi-core processors. Even with recent advances in lazy cache coherence protocols, implementing hardware cache coherence for many-core processors has too high overhead in terms of both power and performance.
A promising option for a more power efficient and scalable memory hierarchy is to have only scratchpad memory in the cores. Since scratchpads consume 30% less area and power than a direct mapped cache of the same effective capacity, Scratchpad based Multicore Processor (SMP) architectures can be extremely power efficient. A very good example of SMP memory architecture is the Cell processor that is used in the Sony Playstation 3. Its power efficiency is around 5 GFlops per watt, while the power efficiency of an Intel i7 4-core Bloomfield 965 XE is only 0.5 GFlops per watt.
Scratchpad based Multi-core Processor (SMP) architecture is a truly “distributed memory architecture on-a-chip.” Therefore, applications on it require programmers to write a bunch of interacting tasks. The tasks are then mapped to the cores of the SMP architecture. Conventionally, a main task executes on a main core and creates execution tasks, which are then distributed and executed on execution cores. The main core has a large global or main memory, but execution cores have only a small local memory (the scratchpad memory). The execution cores can directly access only their local memory. To access other memories, including the global memory, explicit Direct Memory Access (DMA) instructions are needed in the application. In such architectures, the local memory is shared among code, and all data (stack, global and heap) of the task executing on the core. If the task can fit into the local memory, then extremely power-efficient execution can be achieved—and this is indeed the promise of SMP architectures.
However, for the general case, when all the code and data of the task do not fit in the local memory, explicit data management must be done to enable its execution. The programmer can do this, by bringing in the data/code before it is needed, and evicting it back to the global memory after it is no longer needed. However, this is very difficult, since the programmer must now not only be aware of the local memory available in the architecture, but also be cognizant of the memory requirement of the task at every point in the execution of the program. Estimating the memory requirement is difficult for C/C++ programs, as although the code and global data sizes are known at the compilation time, stack and heap sizes may be variable and input data dependent. This difficulty of programming these SMP architectures has been the biggest roadblock in the success of extremely power efficient SMP architectures.
To enable execution on the core of SMP architecture, all code and data must be managed on the local scratchpad, and researchers have started to develop techniques to manage code, stack data and heap data for cores with only scratchpad memories. Of these techniques, developing efficient approaches to manage stack data is especially important, since an average of 64% of all accesses in embedded applications may be to stack variables.
Another type of processor architecture is a Limited Local Memory (LLM) architecture. Limited Local Memory (LLM) multi-core architectures are scalable, distributed memory architectures, that are quite power-efficient. In an LLM multi-core processor, each core has a scratch pad like local memory, which is not cached. Any data transfers between the global memory and the local memory must be explicitly present as Direct Memory Access (DMA) commands in the application. The IBM Cell BE is a good example of LLM multi-core architecture, which has a 256 KB local memory on each core.
LLM multi-core architectures are programmed in a multithreaded paradigm with MPI (Message Passing Interface) like explicit communication between the threads. The application threads are mapped to the cores. If the entire code and data of the thread executing on the core can fit into the local memory of the core, the application will execute extremely power efficiently—and this is indeed the promise of LLM multicore architectures. However, if the data requirements of the thread exceed the size of the local memory, there are probably two options: First, the programmer can re-partition and re-parallelize the application by changing the algorithm. However changing the natural way of parallelization of an application can be counterintuitive and a formidable task. Second, the programmer can manage thread data in the local memory. This implies inserting DMA calls to bring data before it is needed, and to evict not-so-urgently needed data out of the local memory, so that it is possible to operate within the local memory size constraints.
The chief attraction of the second option, i.e. data management, is that it keeps application programming natural and easy, and the data management problem may be simpler, since it is local to a thread (and core). In the absence of any tools or libraries that assist in data management (i.e., a compiler), it is typically done manually, and requires a programmer to know which variables are needed and should be brought into the local memory, and which ones are not so urgently needed, and therefore can be evicted out of the local memory for a while.
Thus, there is a need for improved systems and methods for managing stack memory in SMP and LLM architectures.